| [2025-11-26 03:25:26,747] [INFO] [axolotl.utils.data.sft._load_raw_datasets:320] [PID:64100] Loading raw datasets... | |
| [2025-11-26 03:25:29,268] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: ToastyPigeon/limarp-augmented-train-last-only with base_type: chat_template and prompt_style: None | |
| [2025-11-26 03:25:29,972] [WARNING] [huggingface_hub.repocard.content:108] [PID:64100] Repo card metadata block was not found. Setting CardData to empty. | |
| [2025-11-26 03:25:30,754] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: ToastyPigeon/mixed-medical-reasoning-formatted with base_type: chat_template and prompt_style: None | |
| [2025-11-26 03:25:32,584] [WARNING] [huggingface_hub.repocard.content:108] [PID:64100] Repo card metadata block was not found. Setting CardData to empty. | |
| [2025-11-26 03:25:33,613] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: ToastyPigeon/kimi-stories-instruct with base_type: chat_template and prompt_style: None | |
| [2025-11-26 03:25:35,627] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: allura-forge/koto-instruct-sft-nothink with base_type: chat_template and prompt_style: None | |
| [2025-11-26 03:25:36,310] [WARNING] [huggingface_hub.repocard.content:108] [PID:64100] Repo card metadata block was not found. Setting CardData to empty. | |
| [2025-11-26 03:25:37,341] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: ToastyPigeon/SpringDragon-Instruct with base_type: chat_template and prompt_style: None | |
| Tokenizing Prompts (num_proc=24): 0%| | 0/2535 [00:00<?, ? examples/s] Tokenizing Prompts (num_proc=24): 4%|ββ | 106/2535 [00:03<01:30, 26.77 examples/s] Tokenizing Prompts (num_proc=24): 8%|βββ | 212/2535 [00:04<00:46, 49.58 examples/s] Tokenizing Prompts (num_proc=24): 17%|ββββββ | 423/2535 [00:04<00:17, 124.16 examples/s] Tokenizing Prompts (num_proc=24): 25%|βββββββββ | 635/2535 [00:05<00:10, 188.35 examples/s] Tokenizing Prompts (num_proc=24): 33%|ββββββββββββ | 846/2535 [00:05<00:05, 293.27 examples/s] Tokenizing Prompts (num_proc=24): 38%|ββββββββββββββ | 951/2535 [00:05<00:05, 292.11 examples/s] Tokenizing Prompts (num_proc=24): 50%|ββββββββββββββββββ | 1268/2535 [00:06<00:02, 496.58 examples/s] Tokenizing Prompts (num_proc=24): 58%|βββββββββββββββββββββ | 1479/2535 [00:06<00:01, 548.63 examples/s] Tokenizing Prompts (num_proc=24): 63%|ββββββββββββββββββββββ | 1585/2535 [00:06<00:01, 563.28 examples/s] Tokenizing Prompts (num_proc=24): 67%|ββββββββββββββββββββββββ | 1690/2535 [00:06<00:01, 582.65 examples/s] Tokenizing Prompts (num_proc=24): 71%|βββββββββββββββββββββββββ | 1795/2535 [00:07<00:01, 513.55 examples/s] Tokenizing Prompts (num_proc=24): 75%|βββββββββββββββββββββββββββ | 1900/2535 [00:07<00:01, 356.77 examples/s] Tokenizing Prompts (num_proc=24): 79%|ββββββββββββββββββββββββββββ | 2006/2535 [00:08<00:01, 270.39 examples/s] Tokenizing Prompts (num_proc=24): 83%|ββββββββββββββββββββββββββββββ | 2112/2535 [00:09<00:02, 202.48 examples/s] Tokenizing Prompts (num_proc=24): 87%|βββββββββββββββββββββββββββββββ | 2218/2535 [00:09<00:01, 222.42 examples/s] Tokenizing Prompts (num_proc=24): 92%|ββββββββββββββββββββββββββββββββ | 2323/2535 [00:10<00:00, 213.17 examples/s] Tokenizing Prompts (num_proc=24): 96%|ββββββββββββββββββββββββββββββββββ | 2429/2535 [00:10<00:00, 239.41 examples/s] Tokenizing Prompts (num_proc=24): 100%|ββββββββββββββββββββββββββββββββββββ| 2535/2535 [00:13<00:00, 89.06 examples/s] Tokenizing Prompts (num_proc=24): 100%|βββββββββββββββββββββββββββββββββββ| 2535/2535 [00:13<00:00, 186.31 examples/s] | |
| [2025-11-26 03:25:52,171] [WARNING] [huggingface_hub.repocard.content:108] [PID:64100] Repo card metadata block was not found. Setting CardData to empty. | |
| [2025-11-26 03:25:53,391] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: ToastyPigeon/tulu-mini with base_type: chat_template and prompt_style: None | |
| Tokenizing Prompts (num_proc=24): 0%| | 0/43790 [00:00<?, ? examples/s] Tokenizing Prompts (num_proc=24): 2%|β | 1000/43790 [00:02<01:56, 366.05 examples/s] Tokenizing Prompts (num_proc=24): 5%|ββ | 2000/43790 [00:03<01:08, 609.96 examples/s] Tokenizing Prompts (num_proc=24): 7%|βββ | 3000/43790 [00:03<00:43, 938.14 examples/s] Tokenizing Prompts (num_proc=24): 9%|βββ | 4000/43790 [00:04<00:28, 1381.41 examples/s] Tokenizing Prompts (num_proc=24): 18%|ββββββ | 8000/43790 [00:04<00:08, 3979.88 examples/s] Tokenizing Prompts (num_proc=24): 23%|ββββββββ | 10000/43790 [00:04<00:09, 3732.18 examples/s] Tokenizing Prompts (num_proc=24): 25%|ββββββββ | 11000/43790 [00:05<00:08, 3783.00 examples/s] Tokenizing Prompts (num_proc=24): 29%|ββββββββββ | 12825/43790 [00:05<00:06, 4896.77 examples/s] Tokenizing Prompts (num_proc=24): 33%|βββββββββββ | 14475/43790 [00:05<00:05, 5426.74 examples/s] Tokenizing Prompts (num_proc=24): 39%|βββββββββββββ | 17125/43790 [00:05<00:03, 8024.04 examples/s] Tokenizing Prompts (num_proc=24): 43%|ββββββββββββββ | 18950/43790 [00:05<00:03, 7965.45 examples/s] Tokenizing Prompts (num_proc=24): 47%|ββββββββββββββββ | 20775/43790 [00:06<00:02, 8363.22 examples/s] Tokenizing Prompts (num_proc=24): 59%|βββββββββββββββββββ | 25775/43790 [00:06<00:01, 14875.75 examples/s] Tokenizing Prompts (num_proc=24): 65%|βββββββββββββββββββββ | 28600/43790 [00:06<00:01, 10170.57 examples/s] Tokenizing Prompts (num_proc=24): 71%|βββββββββββββββββββββββ | 31249/43790 [00:07<00:01, 6509.34 examples/s] Tokenizing Prompts (num_proc=24): 75%|ββββββββββββββββββββββββ | 32898/43790 [00:07<00:01, 7073.35 examples/s] Tokenizing Prompts (num_proc=24): 79%|ββββββββββββββββββββββββββ | 34546/43790 [00:07<00:01, 7570.30 examples/s] Tokenizing Prompts (num_proc=24): 83%|βββββββββββββββββββββββββββ | 36194/43790 [00:07<00:00, 8196.35 examples/s] Tokenizing Prompts (num_proc=24): 86%|ββββββββββββββββββββββββββββ | 37842/43790 [00:08<00:00, 9217.97 examples/s] Tokenizing Prompts (num_proc=24): 90%|βββββββββββββββββββββββββββββ | 39492/43790 [00:08<00:00, 6227.85 examples/s] Tokenizing Prompts (num_proc=24): 94%|ββββββββββββββββββββββββββββββ | 41141/43790 [00:08<00:00, 5357.82 examples/s] Tokenizing Prompts (num_proc=24): 96%|βββββββββββββββββββββββββββββββ | 41966/43790 [00:09<00:00, 4723.69 examples/s] Tokenizing Prompts (num_proc=24): 98%|ββββββββββββββββββββββββββββββββ| 42966/43790 [00:10<00:00, 2314.47 examples/s] Tokenizing Prompts (num_proc=24): 100%|ββββββββββββββββββββββββββββββββ| 43790/43790 [00:11<00:00, 1655.33 examples/s] Tokenizing Prompts (num_proc=24): 100%|ββββββββββββββββββββββββββββββββ| 43790/43790 [00:11<00:00, 3719.04 examples/s] | |
| Dropping Long Sequences (>4096) (num_proc=24): 0%| | 0/132318 [00:00<?, ? examples/s] Dropping Long Sequences (>4096) (num_proc=24): 1%|β | 1000/132318 [00:01<02:22, 922.06 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 8%|ββ | 10000/132318 [00:01<00:11, 11073.18 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 16%|βββ | 21000/132318 [00:01<00:04, 24182.19 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 21%|ββββ | 28000/132318 [00:01<00:04, 23802.12 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 30%|ββββββ | 40000/132318 [00:01<00:02, 37748.88 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 36%|βββββββ | 48000/132318 [00:01<00:02, 41263.96 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 42%|βββββββ | 55000/132318 [00:02<00:02, 34709.41 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 48%|ββββββββ | 63000/132318 [00:02<00:01, 41579.13 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 53%|βββββββββ | 70000/132318 [00:02<00:01, 46552.60 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 58%|ββββββββββ | 77000/132318 [00:02<00:01, 37040.55 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 66%|ββββββββββββ | 87000/132318 [00:02<00:00, 47883.31 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 71%|ββββββββββββ | 94000/132318 [00:02<00:00, 48432.39 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 76%|ββββββββββββ | 100000/132318 [00:03<00:00, 38109.09 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 82%|ββββββββββββββ | 109000/132318 [00:03<00:00, 47152.44 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 88%|ββββββββββββββ | 116514/132318 [00:03<00:00, 52490.56 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 94%|βββββββββββββββ | 124105/132318 [00:03<00:00, 57668.18 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 99%|ββββββββββββββββ| 131292/132318 [00:03<00:00, 52604.65 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 100%|ββββββββββββββββ| 132318/132318 [00:03<00:00, 33596.79 examples/s] | |
| Drop Samples with Zero Trainable Tokens (num_proc=24): 0%| | 0/123829 [00:00<?, ? examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 1%| | 1000/123829 [00:01<02:28, 826.71 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 11%|β | 14000/123829 [00:01<00:07, 14416.15 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 19%|ββ | 23000/123829 [00:01<00:04, 23616.20 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 25%|βββ | 31000/123829 [00:01<00:03, 25687.85 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 34%|βββ | 42000/123829 [00:01<00:02, 37598.53 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 40%|ββββ | 50000/123829 [00:02<00:01, 37701.53 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 47%|βββββ | 58000/123829 [00:02<00:01, 44307.59 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 56%|βββββ | 69000/123829 [00:02<00:00, 56119.64 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 62%|ββββββ | 77000/123829 [00:02<00:00, 50779.32 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 69%|βββββββ | 85000/123829 [00:02<00:00, 55240.90 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 75%|βββββββ | 93000/123829 [00:02<00:00, 59724.58 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 81%|βββββββ | 100160/123829 [00:02<00:00, 56181.91 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 89%|ββββββββ| 110437/123829 [00:02<00:00, 66364.79 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 97%|ββββββββ| 119871/123829 [00:03<00:00, 72983.46 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 100%|ββββββββ| 123829/123829 [00:03<00:00, 34971.15 examples/s] | |
| Add position_id column (Sample Packing) (num_proc=24): 0%| | 0/123829 [00:00<?, ? examples/s] Add position_id column (Sample Packing) (num_proc=24): 1%| | 1000/123829 [00:01<02:34, 797.52 examples/s] Add position_id column (Sample Packing) (num_proc=24): 16%|ββ | 20000/123829 [00:01<00:05, 20005.84 examples/s] Add position_id column (Sample Packing) (num_proc=24): 25%|βββ | 31000/123829 [00:01<00:04, 19157.66 examples/s] Add position_id column (Sample Packing) (num_proc=24): 36%|ββββ | 45000/123829 [00:02<00:02, 31006.41 examples/s] Add position_id column (Sample Packing) (num_proc=24): 44%|ββββ | 55000/123829 [00:02<00:02, 24605.20 examples/s] Add position_id column (Sample Packing) (num_proc=24): 54%|βββββ | 67000/123829 [00:02<00:01, 33902.10 examples/s] Add position_id column (Sample Packing) (num_proc=24): 61%|ββββββ | 76000/123829 [00:03<00:02, 18577.96 examples/s] Add position_id column (Sample Packing) (num_proc=24): 75%|βββββββ | 93000/123829 [00:03<00:01, 29649.99 examples/s] Add position_id column (Sample Packing) (num_proc=24): 82%|βββββββ | 102000/123829 [00:04<00:00, 23975.20 examples/s] Add position_id column (Sample Packing) (num_proc=24): 95%|ββββββββ| 117958/123829 [00:04<00:00, 34831.96 examples/s] Add position_id column (Sample Packing) (num_proc=24): 100%|ββββββββ| 123829/123829 [00:06<00:00, 20598.27 examples/s] | |
| Saving the dataset (0/24 shards): 0%| | 0/123829 [00:00<?, ? examples/s] Saving the dataset (0/24 shards): 2%|β | 2000/123829 [00:00<00:37, 3248.08 examples/s] Saving the dataset (1/24 shards): 49%|βββββββββββββββ | 60320/123829 [00:00<00:19, 3248.08 examples/s] Saving the dataset (2/24 shards): 53%|βββββββββββββββββ | 65320/123829 [00:00<00:18, 3248.08 examples/s] Saving the dataset (3/24 shards): 63%|ββββββββββββββββββββ | 77480/123829 [00:00<00:14, 3248.08 examples/s] Saving the dataset (4/24 shards): 66%|βββββββββββββββββββββ | 81800/123829 [00:00<00:12, 3248.08 examples/s] Saving the dataset (5/24 shards): 67%|βββββββββββββββββββββ | 82960/123829 [00:00<00:12, 3248.08 examples/s] Saving the dataset (6/24 shards): 67%|βββββββββββββββββββββ | 83120/123829 [00:00<00:12, 3248.08 examples/s] Saving the dataset (7/24 shards): 67%|βββββββββββββββββββββ | 83120/123829 [00:00<00:12, 3248.08 examples/s] Saving the dataset (8/24 shards): 70%|ββββββββββββββββββββββ | 86280/123829 [00:00<00:11, 3248.08 examples/s] Saving the dataset (9/24 shards): 74%|βββββββββββββββββββββββ | 91440/123829 [00:00<00:09, 3248.08 examples/s] Saving the dataset (10/24 shards): 74%|βββββββββββββββββββββββ | 91600/123829 [00:00<00:09, 3248.08 examples/s] Saving the dataset (11/24 shards): 76%|βββββββββββββββββββββββ | 93760/123829 [00:00<00:09, 3248.08 examples/s] Saving the dataset (12/24 shards): 77%|ββββββββββββββββββββββββ | 95919/123829 [00:00<00:08, 3248.08 examples/s] Saving the dataset (13/24 shards): 78%|ββββββββββββββββββββββββ | 97079/123829 [00:00<00:08, 3248.08 examples/s] Saving the dataset (13/24 shards): 80%|βββββββββββββββββββββββ | 99079/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (14/24 shards): 84%|βββββββββββββββββββββββ | 104238/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (15/24 shards): 86%|ββββββββββββββββββββββββ | 106397/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (16/24 shards): 87%|ββββββββββββββββββββββββ | 107557/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (17/24 shards): 89%|ββββββββββββββββββββββββ | 109716/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (18/24 shards): 90%|βββββββββββββββββββββββββ | 110875/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (19/24 shards): 95%|ββββββββββββββββββββββββββ | 117193/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (20/24 shards): 95%|ββββββββββββββββββββββββββ | 117193/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (21/24 shards): 97%|βββββββββββββββββββββββββββ| 120352/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (22/24 shards): 98%|βββββββββββββββββββββββββββ| 121511/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (23/24 shards): 99%|βββββββββββββββββββββββββββ| 122670/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (24/24 shards): 100%|βββββββββββββββββββββββββββ| 123829/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (24/24 shards): 100%|βββββββββββββββββββββββββββ| 123829/123829 [00:00<00:00, 149566.99 examples/s] | |
| Using Liger RMSNorm! | |
| Loading checkpoint shards: 0%| | 0/6 [00:00<?, ?it/s] Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 6/6 [00:00<00:00, 519.79it/s] | |
| Loading checkpoint shards: 0%| | 0/6 [00:00<?, ?it/s] Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 6/6 [00:00<00:00, 544.77it/s] | |
| [2025-11-26 14:54:12,836] [WARNING] [py.warnings._showwarnmsg:110] [PID:64100] /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:680: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . | |
| warnings.warn( | |