| [2025-12-26 08:21:18,309] [INFO] [axolotl.utils.data.sft._load_raw_datasets:320] [PID:5436] Loading raw datasets... | |
| [2025-12-26 08:21:21,409] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:5436] Loading dataset: darwinkernelpanic/luau_corpus_axolotl with base_type: completion and prompt_style: None | |
| Tokenizing Prompts (num_proc=64): 0%| | 0/22633 [00:00<?, ? examples/s] Tokenizing Prompts (num_proc=64): 2%|ββ | 354/22633 [00:01<01:34, 234.52 examples/s] Tokenizing Prompts (num_proc=64): 5%|ββββ | 1062/22633 [00:01<00:26, 803.01 examples/s] Tokenizing Prompts (num_proc=64): 8%|ββββββ | 1770/22633 [00:01<00:14, 1431.89 examples/s] Tokenizing Prompts (num_proc=64): 11%|ββββββββ | 2478/22633 [00:01<00:09, 2057.21 examples/s] Tokenizing Prompts (num_proc=64): 14%|ββββββββββ | 3186/22633 [00:02<00:07, 2676.87 examples/s] Tokenizing Prompts (num_proc=64): 17%|ββββββββββββ | 3894/22633 [00:02<00:05, 3209.08 examples/s] Tokenizing Prompts (num_proc=64): 20%|βββββββββββββββ | 4602/22633 [00:02<00:04, 3700.39 examples/s] Tokenizing Prompts (num_proc=64): 23%|βββββββββββββββββ | 5310/22633 [00:02<00:04, 4119.88 examples/s] Tokenizing Prompts (num_proc=64): 27%|βββββββββββββββββββ | 6018/22633 [00:02<00:03, 4370.31 examples/s] Tokenizing Prompts (num_proc=64): 30%|βββββββββββββββββββββ | 6726/22633 [00:02<00:03, 4510.41 examples/s] Tokenizing Prompts (num_proc=64): 33%|βββββββββββββββββββββββ | 7434/22633 [00:02<00:03, 4650.63 examples/s] Tokenizing Prompts (num_proc=64): 36%|ββββββββββββββββββββββββββ | 8142/22633 [00:03<00:03, 4703.03 examples/s] Tokenizing Prompts (num_proc=64): 39%|ββββββββββββββββββββββββββββ | 8850/22633 [00:03<00:02, 4743.14 examples/s] Tokenizing Prompts (num_proc=64): 42%|ββββββββββββββββββββββββββββββ | 9558/22633 [00:03<00:02, 4776.23 examples/s] Tokenizing Prompts (num_proc=64): 45%|ββββββββββββββββββββββββββββββββ | 10266/22633 [00:03<00:02, 4400.59 examples/s] Tokenizing Prompts (num_proc=64): 48%|ββββββββββββββββββββββββββββββββββ | 10974/22633 [00:03<00:02, 4289.45 examples/s] Tokenizing Prompts (num_proc=64): 53%|βββββββββββββββββββββββββββββββββββββ | 12036/22633 [00:03<00:02, 5202.46 examples/s] Tokenizing Prompts (num_proc=64): 56%|βββββββββββββββββββββββββββββββββββββββ | 12744/22633 [00:04<00:02, 4492.49 examples/s] Tokenizing Prompts (num_proc=64): 59%|βββββββββββββββββββββββββββββββββββββββββ | 13452/22633 [00:04<00:02, 4575.30 examples/s] Tokenizing Prompts (num_proc=64): 63%|ββββββββββββββββββββββββββββββββββββββββββββ | 14160/22633 [00:04<00:01, 4637.54 examples/s] Tokenizing Prompts (num_proc=64): 66%|ββββββββββββββββββββββββββββββββββββββββββββββ | 14867/22633 [00:04<00:01, 4708.44 examples/s] Tokenizing Prompts (num_proc=64): 69%|ββββββββββββββββββββββββββββββββββββββββββββββββ | 15573/22633 [00:04<00:01, 4717.78 examples/s] Tokenizing Prompts (num_proc=64): 72%|ββββββββββββββββββββββββββββββββββββββββββββββββββ | 16279/22633 [00:04<00:01, 4733.48 examples/s] Tokenizing Prompts (num_proc=64): 75%|ββββββββββββββββββββββββββββββββββββββββββββββββββββ | 16985/22633 [00:04<00:01, 4739.48 examples/s] Tokenizing Prompts (num_proc=64): 78%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 17691/22633 [00:05<00:01, 4808.79 examples/s] Tokenizing Prompts (num_proc=64): 81%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 18397/22633 [00:05<00:00, 4777.92 examples/s] Tokenizing Prompts (num_proc=64): 84%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 19103/22633 [00:05<00:00, 4793.38 examples/s] Tokenizing Prompts (num_proc=64): 88%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 19809/22633 [00:05<00:00, 4805.58 examples/s] Tokenizing Prompts (num_proc=64): 91%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 20515/22633 [00:05<00:00, 4803.13 examples/s] Tokenizing Prompts (num_proc=64): 94%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 21221/22633 [00:05<00:00, 5103.36 examples/s] Tokenizing Prompts (num_proc=64): 97%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 21927/22633 [00:05<00:00, 5144.60 examples/s] Tokenizing Prompts (num_proc=64): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 22633/22633 [00:06<00:00, 5121.25 examples/s] Tokenizing Prompts (num_proc=64): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 22633/22633 [00:06<00:00, 3469.95 examples/s] | |
| Dropping Long Sequences (>2048) (num_proc=64): 0%| | 0/22640 [00:00<?, ? examples/s] Dropping Long Sequences (>2048) (num_proc=64): 2%|β | 354/22640 [00:01<01:14, 299.24 examples/s] Dropping Long Sequences (>2048) (num_proc=64): 6%|ββββ | 1416/22640 [00:01<00:15, 1392.33 examples/s] Dropping Long Sequences (>2048) (num_proc=64): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 22640/22640 [00:01<00:00, 14654.26 examples/s] | |
| Drop Samples with Zero Trainable Tokens (num_proc=64): 0%| | 0/22640 [00:00<?, ? examples/s] Drop Samples with Zero Trainable Tokens (num_proc=64): 2%|β | 354/22640 [00:01<01:15, 297.14 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=64): 8%|ββββ | 1770/22640 [00:01<00:11, 1778.86 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=64): 17%|βββββββββ | 3894/22640 [00:01<00:04, 4157.04 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=64): 39%|βββββββββββββββββββ | 8850/22640 [00:01<00:01, 11087.04 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=64): 100%|βββββββββββββββββββββββββββββββββββββββββββββββ| 22640/22640 [00:01<00:00, 12773.18 examples/s] | |
| Add position_id column (Sample Packing) (num_proc=64): 0%| | 0/22640 [00:00<?, ? examples/s] Add position_id column (Sample Packing) (num_proc=64): 2%|β | 354/22640 [00:01<01:14, 299.14 examples/s] Add position_id column (Sample Packing) (num_proc=64): 6%|βββ | 1416/22640 [00:01<00:15, 1379.16 examples/s] Add position_id column (Sample Packing) (num_proc=64): 80%|ββββββββββββββββββββββββββββββββββββββ | 18051/22640 [00:01<00:00, 22934.39 examples/s] Add position_id column (Sample Packing) (num_proc=64): 100%|βββββββββββββββββββββββββββββββββββββββββββββββ| 22640/22640 [00:01<00:00, 13969.82 examples/s] | |
| Saving the dataset (0/64 shards): 0%| | 0/22640 [00:00<?, ? examples/s] Saving the dataset (0/64 shards): 2%|ββ | 354/22640 [00:01<01:19, 280.97 examples/s] Saving the dataset (1/64 shards): 2%|ββ | 354/22640 [00:01<01:19, 280.97 examples/s] Saving the dataset (2/64 shards): 3%|βββ | 708/22640 [00:01<01:18, 280.97 examples/s] Saving the dataset (3/64 shards): 5%|ββββ | 1062/22640 [00:01<01:16, 280.97 examples/s] Saving the dataset (4/64 shards): 6%|βββββ | 1416/22640 [00:01<01:15, 280.97 examples/s] Saving the dataset (5/64 shards): 8%|ββββββ | 1770/22640 [00:01<01:14, 280.97 examples/s] Saving the dataset (6/64 shards): 9%|βββββββ | 2124/22640 [00:01<01:13, 280.97 examples/s] Saving the dataset (7/64 shards): 11%|ββββββββ | 2478/22640 [00:01<01:11, 280.97 examples/s] Saving the dataset (8/64 shards): 13%|βββββββββ | 2832/22640 [00:01<01:10, 280.97 examples/s] Saving the dataset (9/64 shards): 14%|ββββββββββ | 3186/22640 [00:01<01:09, 280.97 examples/s] Saving the dataset (10/64 shards): 16%|βββββββββββ | 3540/22640 [00:01<01:07, 280.97 examples/s] Saving the dataset (11/64 shards): 17%|ββββββββββββ | 3894/22640 [00:01<01:06, 280.97 examples/s] Saving the dataset (12/64 shards): 19%|ββββββββββββββ | 4248/22640 [00:01<01:05, 280.97 examples/s] Saving the dataset (13/64 shards): 20%|βββββββββββββββ | 4602/22640 [00:01<01:04, 280.97 examples/s] Saving the dataset (14/64 shards): 22%|ββββββββββββββββ | 4956/22640 [00:01<01:02, 280.97 examples/s] Saving the dataset (15/64 shards): 23%|βββββββββββββββββ | 5310/22640 [00:01<01:01, 280.97 examples/s] Saving the dataset (16/64 shards): 25%|ββββββββββββββββββ | 5664/22640 [00:01<01:00, 280.97 examples/s] Saving the dataset (17/64 shards): 27%|βββββββββββββββββββ | 6018/22640 [00:01<00:59, 280.97 examples/s] Saving the dataset (18/64 shards): 28%|ββββββββββββββββββββ | 6372/22640 [00:01<00:57, 280.97 examples/s] Saving the dataset (19/64 shards): 30%|βββββββββββββββββββββ | 6726/22640 [00:01<00:56, 280.97 examples/s] Saving the dataset (20/64 shards): 31%|ββββββββββββββββββββββ | 7080/22640 [00:01<00:55, 280.97 examples/s] Saving the dataset (21/64 shards): 33%|βββββββββββββββββββββββ | 7434/22640 [00:01<00:54, 280.97 examples/s] Saving the dataset (22/64 shards): 34%|ββββββββββββββββββββββββ | 7788/22640 [00:01<00:52, 280.97 examples/s] Saving the dataset (23/64 shards): 36%|ββββββββββββββββββββββββββ | 8142/22640 [00:01<00:51, 280.97 examples/s] Saving the dataset (24/64 shards): 38%|βββββββββββββββββββββββββββ | 8496/22640 [00:01<00:50, 280.97 examples/s] Saving the dataset (25/64 shards): 39%|ββββββββββββββββββββββββββββ | 8850/22640 [00:01<00:49, 280.97 examples/s] Saving the dataset (26/64 shards): 41%|βββββββββββββββββββββββββββββ | 9204/22640 [00:01<00:47, 280.97 examples/s] Saving the dataset (27/64 shards): 42%|ββββββββββββββββββββββββββββββ | 9558/22640 [00:01<00:46, 280.97 examples/s] Saving the dataset (28/64 shards): 44%|βββββββββββββββββββββββββββββββ | 9912/22640 [00:01<00:45, 280.97 examples/s] Saving the dataset (29/64 shards): 45%|ββββββββββββββββββββββββββββββββ | 10266/22640 [00:01<00:44, 280.97 examples/s] Saving the dataset (30/64 shards): 47%|βββββββββββββββββββββββββββββββββ | 10620/22640 [00:01<00:42, 280.97 examples/s] Saving the dataset (31/64 shards): 48%|ββββββββββββββββββββββββββββββββββ | 10974/22640 [00:01<00:41, 280.97 examples/s] Saving the dataset (32/64 shards): 50%|βββββββββββββββββββββββββββββββββββ | 11328/22640 [00:01<00:40, 280.97 examples/s] Saving the dataset (33/64 shards): 53%|βββββββββββββββββββββββββββββββββββββ | 12036/22640 [00:01<00:37, 280.97 examples/s] Saving the dataset (34/64 shards): 53%|βββββββββββββββββββββββββββββββββββββ | 12036/22640 [00:01<00:37, 280.97 examples/s] Saving the dataset (35/64 shards): 55%|ββββββββββββββββββββββββββββββββββββββ | 12390/22640 [00:01<00:36, 280.97 examples/s] Saving the dataset (36/64 shards): 56%|βββββββββββββββββββββββββββββββββββββββ | 12744/22640 [00:01<00:35, 280.97 examples/s] Saving the dataset (37/64 shards): 58%|ββββββββββββββββββββββββββββββββββββββββ | 13098/22640 [00:01<00:33, 280.97 examples/s] Saving the dataset (38/64 shards): 59%|βββββββββββββββββββββββββββββββββββββββββ | 13452/22640 [00:01<00:32, 280.97 examples/s] Saving the dataset (39/64 shards): 61%|ββββββββββββββββββββββββββββββββββββββββββ | 13806/22640 [00:01<00:31, 280.97 examples/s] Saving the dataset (40/64 shards): 64%|βββββββββββββββββββββββββββββββββββββββββββββ | 14514/22640 [00:01<00:28, 280.97 examples/s] Saving the dataset (41/64 shards): 64%|βββββββββββββββββββββββββββββββββββββββββββββ | 14514/22640 [00:01<00:28, 280.97 examples/s] Saving the dataset (42/64 shards): 66%|ββββββββββββββββββββββββββββββββββββββββββββββ | 14868/22640 [00:01<00:27, 280.97 examples/s] Saving the dataset (43/64 shards): 67%|βββββββββββββββββββββββββββββββββββββββββββββββ | 15222/22640 [00:01<00:26, 280.97 examples/s] Saving the dataset (44/64 shards): 70%|βββββββββββββββββββββββββββββββββββββββββββββββββ | 15930/22640 [00:01<00:23, 280.97 examples/s] Saving the dataset (45/64 shards): 70%|βββββββββββββββββββββββββββββββββββββββββββββββββ | 15930/22640 [00:01<00:23, 280.97 examples/s] Saving the dataset (46/64 shards): 72%|ββββββββββββββββββββββββββββββββββββββββββββββββββ | 16284/22640 [00:01<00:22, 280.97 examples/s] Saving the dataset (47/64 shards): 73%|βββββββββββββββββββββββββββββββββββββββββββββββββββ | 16638/22640 [00:01<00:21, 280.97 examples/s] Saving the dataset (48/64 shards): 75%|ββββββββββββββββββββββββββββββββββββββββββββββββββββ | 16992/22640 [00:01<00:20, 280.97 examples/s] Saving the dataset (49/64 shards): 78%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 17698/22640 [00:01<00:17, 280.97 examples/s] Saving the dataset (50/64 shards): 78%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 17698/22640 [00:01<00:17, 280.97 examples/s] Saving the dataset (51/64 shards): 80%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 18051/22640 [00:01<00:16, 280.97 examples/s] Saving the dataset (52/64 shards): 81%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 18404/22640 [00:01<00:15, 280.97 examples/s] Saving the dataset (53/64 shards): 83%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 18757/22640 [00:01<00:13, 280.97 examples/s] Saving the dataset (54/64 shards): 84%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 19110/22640 [00:01<00:12, 280.97 examples/s] Saving the dataset (55/64 shards): 86%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 19463/22640 [00:01<00:11, 280.97 examples/s] Saving the dataset (56/64 shards): 88%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 19816/22640 [00:01<00:10, 280.97 examples/s] Saving the dataset (57/64 shards): 89%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 20169/22640 [00:01<00:08, 280.97 examples/s] Saving the dataset (58/64 shards): 91%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 20522/22640 [00:01<00:07, 280.97 examples/s] Saving the dataset (59/64 shards): 92%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 20875/22640 [00:01<00:06, 280.97 examples/s] Saving the dataset (60/64 shards): 94%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 21228/22640 [00:01<00:05, 280.97 examples/s] Saving the dataset (61/64 shards): 95%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 21581/22640 [00:01<00:03, 280.97 examples/s] Saving the dataset (62/64 shards): 97%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 21934/22640 [00:01<00:02, 280.97 examples/s] Saving the dataset (63/64 shards): 98%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 22287/22640 [00:01<00:01, 280.97 examples/s] Saving the dataset (64/64 shards): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 22640/22640 [00:01<00:00, 280.97 examples/s] Saving the dataset (64/64 shards): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 22640/22640 [00:01<00:00, 16486.64 examples/s] | |
| Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|ββββββββββββββββββββββββ | 1/4 [00:03<00:09, 3.18s/it] Loading checkpoint shards: 50%|βββββββββββββββββββββββββββββββββββββββββββββββ | 2/4 [00:07<00:07, 3.62s/it] Loading checkpoint shards: 75%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 3/4 [00:10<00:03, 3.53s/it] Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [00:10<00:00, 2.20s/it] Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [00:10<00:00, 2.67s/it] | |
| [2025-12-26 08:23:32,340] [WARNING] [py.warnings._showwarnmsg:110] [PID:5436] <string>:204: RuntimeWarning: Mean of empty slice | |
| [2025-12-26 08:26:22,565] [WARNING] [py.warnings._showwarnmsg:110] [PID:5436] /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:678: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . | |
| warnings.warn( | |
| [2025-12-26 08:26:40,186] [WARNING] [py.warnings._showwarnmsg:110] [PID:5436] /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:860: UserWarning: `_get_pg_default_device` will be deprecated, it only stays for backward-compatiblity reason. If you need to find a device for object collectives, please use `_get_object_coll_device`. If you need to query the device types supported by group, please use `_device_capability(group)`. | |
| warnings.warn( | |
| [2025-12-26 08:26:40,186] [WARNING] [py.warnings._showwarnmsg:110] [PID:5436] /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:904: UserWarning: Multiple backends are registered with this ProcessGroup. We cannot determine which one is the default. Returning cpu. Please consider using other APIs. | |
| warnings.warn( | |