apertus-12b-instruct-1ep-lora / debug.log

Upload folder using huggingface_hub

9f9bf08 verified about 2 months ago

21.2 kB

	[2025-11-26 03:25:26,747] [INFO] [axolotl.utils.data.sft._load_raw_datasets:320] [PID:64100] Loading raw datasets...
	[2025-11-26 03:25:29,268] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: ToastyPigeon/limarp-augmented-train-last-only with base_type: chat_template and prompt_style: None
	[2025-11-26 03:25:29,972] [WARNING] [huggingface_hub.repocard.content:108] [PID:64100] Repo card metadata block was not found. Setting CardData to empty.
	[2025-11-26 03:25:30,754] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: ToastyPigeon/mixed-medical-reasoning-formatted with base_type: chat_template and prompt_style: None
	[2025-11-26 03:25:32,584] [WARNING] [huggingface_hub.repocard.content:108] [PID:64100] Repo card metadata block was not found. Setting CardData to empty.
	[2025-11-26 03:25:33,613] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: ToastyPigeon/kimi-stories-instruct with base_type: chat_template and prompt_style: None
	[2025-11-26 03:25:35,627] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: allura-forge/koto-instruct-sft-nothink with base_type: chat_template and prompt_style: None
	[2025-11-26 03:25:36,310] [WARNING] [huggingface_hub.repocard.content:108] [PID:64100] Repo card metadata block was not found. Setting CardData to empty.
	[2025-11-26 03:25:37,341] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: ToastyPigeon/SpringDragon-Instruct with base_type: chat_template and prompt_style: None
	Tokenizing Prompts (num_proc=24): 0%\| \| 0/2535 [00:00<?, ? examples/s] Tokenizing Prompts (num_proc=24): 4%\|█▌ \| 106/2535 [00:03<01:30, 26.77 examples/s] Tokenizing Prompts (num_proc=24): 8%\|███ \| 212/2535 [00:04<00:46, 49.58 examples/s] Tokenizing Prompts (num_proc=24): 17%\|██████ \| 423/2535 [00:04<00:17, 124.16 examples/s] Tokenizing Prompts (num_proc=24): 25%\|█████████ \| 635/2535 [00:05<00:10, 188.35 examples/s] Tokenizing Prompts (num_proc=24): 33%\|████████████ \| 846/2535 [00:05<00:05, 293.27 examples/s] Tokenizing Prompts (num_proc=24): 38%\|█████████████▌ \| 951/2535 [00:05<00:05, 292.11 examples/s] Tokenizing Prompts (num_proc=24): 50%\|█████████████████▌ \| 1268/2535 [00:06<00:02, 496.58 examples/s] Tokenizing Prompts (num_proc=24): 58%\|████████████████████▍ \| 1479/2535 [00:06<00:01, 548.63 examples/s] Tokenizing Prompts (num_proc=24): 63%\|█████████████████████▉ \| 1585/2535 [00:06<00:01, 563.28 examples/s] Tokenizing Prompts (num_proc=24): 67%\|███████████████████████▎ \| 1690/2535 [00:06<00:01, 582.65 examples/s] Tokenizing Prompts (num_proc=24): 71%\|████████████████████████▊ \| 1795/2535 [00:07<00:01, 513.55 examples/s] Tokenizing Prompts (num_proc=24): 75%\|██████████████████████████▏ \| 1900/2535 [00:07<00:01, 356.77 examples/s] Tokenizing Prompts (num_proc=24): 79%\|███████████████████████████▋ \| 2006/2535 [00:08<00:01, 270.39 examples/s] Tokenizing Prompts (num_proc=24): 83%\|█████████████████████████████▏ \| 2112/2535 [00:09<00:02, 202.48 examples/s] Tokenizing Prompts (num_proc=24): 87%\|██████████████████████████████▌ \| 2218/2535 [00:09<00:01, 222.42 examples/s] Tokenizing Prompts (num_proc=24): 92%\|████████████████████████████████ \| 2323/2535 [00:10<00:00, 213.17 examples/s] Tokenizing Prompts (num_proc=24): 96%\|█████████████████████████████████▌ \| 2429/2535 [00:10<00:00, 239.41 examples/s] Tokenizing Prompts (num_proc=24): 100%\|████████████████████████████████████\| 2535/2535 [00:13<00:00, 89.06 examples/s] Tokenizing Prompts (num_proc=24): 100%\|███████████████████████████████████\| 2535/2535 [00:13<00:00, 186.31 examples/s]
	[2025-11-26 03:25:52,171] [WARNING] [huggingface_hub.repocard.content:108] [PID:64100] Repo card metadata block was not found. Setting CardData to empty.
	[2025-11-26 03:25:53,391] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: ToastyPigeon/tulu-mini with base_type: chat_template and prompt_style: None
	Tokenizing Prompts (num_proc=24): 0%\| \| 0/43790 [00:00<?, ? examples/s] Tokenizing Prompts (num_proc=24): 2%\|▊ \| 1000/43790 [00:02<01:56, 366.05 examples/s] Tokenizing Prompts (num_proc=24): 5%\|█▌ \| 2000/43790 [00:03<01:08, 609.96 examples/s] Tokenizing Prompts (num_proc=24): 7%\|██▎ \| 3000/43790 [00:03<00:43, 938.14 examples/s] Tokenizing Prompts (num_proc=24): 9%\|███ \| 4000/43790 [00:04<00:28, 1381.41 examples/s] Tokenizing Prompts (num_proc=24): 18%\|██████ \| 8000/43790 [00:04<00:08, 3979.88 examples/s] Tokenizing Prompts (num_proc=24): 23%\|███████▎ \| 10000/43790 [00:04<00:09, 3732.18 examples/s] Tokenizing Prompts (num_proc=24): 25%\|████████ \| 11000/43790 [00:05<00:08, 3783.00 examples/s] Tokenizing Prompts (num_proc=24): 29%\|█████████▎ \| 12825/43790 [00:05<00:06, 4896.77 examples/s] Tokenizing Prompts (num_proc=24): 33%\|██████████▌ \| 14475/43790 [00:05<00:05, 5426.74 examples/s] Tokenizing Prompts (num_proc=24): 39%\|████████████▌ \| 17125/43790 [00:05<00:03, 8024.04 examples/s] Tokenizing Prompts (num_proc=24): 43%\|█████████████▊ \| 18950/43790 [00:05<00:03, 7965.45 examples/s] Tokenizing Prompts (num_proc=24): 47%\|███████████████▏ \| 20775/43790 [00:06<00:02, 8363.22 examples/s] Tokenizing Prompts (num_proc=24): 59%\|██████████████████▏ \| 25775/43790 [00:06<00:01, 14875.75 examples/s] Tokenizing Prompts (num_proc=24): 65%\|████████████████████▏ \| 28600/43790 [00:06<00:01, 10170.57 examples/s] Tokenizing Prompts (num_proc=24): 71%\|██████████████████████▊ \| 31249/43790 [00:07<00:01, 6509.34 examples/s] Tokenizing Prompts (num_proc=24): 75%\|████████████████████████ \| 32898/43790 [00:07<00:01, 7073.35 examples/s] Tokenizing Prompts (num_proc=24): 79%\|█████████████████████████▏ \| 34546/43790 [00:07<00:01, 7570.30 examples/s] Tokenizing Prompts (num_proc=24): 83%\|██████████████████████████▍ \| 36194/43790 [00:07<00:00, 8196.35 examples/s] Tokenizing Prompts (num_proc=24): 86%\|███████████████████████████▋ \| 37842/43790 [00:08<00:00, 9217.97 examples/s] Tokenizing Prompts (num_proc=24): 90%\|████████████████████████████▊ \| 39492/43790 [00:08<00:00, 6227.85 examples/s] Tokenizing Prompts (num_proc=24): 94%\|██████████████████████████████ \| 41141/43790 [00:08<00:00, 5357.82 examples/s] Tokenizing Prompts (num_proc=24): 96%\|██████████████████████████████▋ \| 41966/43790 [00:09<00:00, 4723.69 examples/s] Tokenizing Prompts (num_proc=24): 98%\|███████████████████████████████▍\| 42966/43790 [00:10<00:00, 2314.47 examples/s] Tokenizing Prompts (num_proc=24): 100%\|████████████████████████████████\| 43790/43790 [00:11<00:00, 1655.33 examples/s] Tokenizing Prompts (num_proc=24): 100%\|████████████████████████████████\| 43790/43790 [00:11<00:00, 3719.04 examples/s]
	Dropping Long Sequences (>4096) (num_proc=24): 0%\| \| 0/132318 [00:00<?, ? examples/s] Dropping Long Sequences (>4096) (num_proc=24): 1%\|▏ \| 1000/132318 [00:01<02:22, 922.06 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 8%\|█▎ \| 10000/132318 [00:01<00:11, 11073.18 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 16%\|██▋ \| 21000/132318 [00:01<00:04, 24182.19 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 21%\|███▌ \| 28000/132318 [00:01<00:04, 23802.12 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 30%\|█████▏ \| 40000/132318 [00:01<00:02, 37748.88 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 36%\|██████▏ \| 48000/132318 [00:01<00:02, 41263.96 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 42%\|███████ \| 55000/132318 [00:02<00:02, 34709.41 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 48%\|████████ \| 63000/132318 [00:02<00:01, 41579.13 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 53%\|████████▉ \| 70000/132318 [00:02<00:01, 46552.60 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 58%\|█████████▉ \| 77000/132318 [00:02<00:01, 37040.55 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 66%\|███████████▏ \| 87000/132318 [00:02<00:00, 47883.31 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 71%\|████████████ \| 94000/132318 [00:02<00:00, 48432.39 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 76%\|████████████ \| 100000/132318 [00:03<00:00, 38109.09 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 82%\|█████████████▏ \| 109000/132318 [00:03<00:00, 47152.44 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 88%\|██████████████ \| 116514/132318 [00:03<00:00, 52490.56 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 94%\|███████████████ \| 124105/132318 [00:03<00:00, 57668.18 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 99%\|███████████████▉\| 131292/132318 [00:03<00:00, 52604.65 examples/s] Dropping Long Sequences (>4096) (num_proc=24): 100%\|████████████████\| 132318/132318 [00:03<00:00, 33596.79 examples/s]
	Drop Samples with Zero Trainable Tokens (num_proc=24): 0%\| \| 0/123829 [00:00<?, ? examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 1%\| \| 1000/123829 [00:01<02:28, 826.71 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 11%\|█ \| 14000/123829 [00:01<00:07, 14416.15 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 19%\|█▋ \| 23000/123829 [00:01<00:04, 23616.20 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 25%\|██▎ \| 31000/123829 [00:01<00:03, 25687.85 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 34%\|███ \| 42000/123829 [00:01<00:02, 37598.53 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 40%\|███▋ \| 50000/123829 [00:02<00:01, 37701.53 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 47%\|████▏ \| 58000/123829 [00:02<00:01, 44307.59 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 56%\|█████ \| 69000/123829 [00:02<00:00, 56119.64 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 62%\|█████▌ \| 77000/123829 [00:02<00:00, 50779.32 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 69%\|██████▏ \| 85000/123829 [00:02<00:00, 55240.90 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 75%\|██████▊ \| 93000/123829 [00:02<00:00, 59724.58 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 81%\|██████▍ \| 100160/123829 [00:02<00:00, 56181.91 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 89%\|███████▏\| 110437/123829 [00:02<00:00, 66364.79 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 97%\|███████▋\| 119871/123829 [00:03<00:00, 72983.46 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=24): 100%\|████████\| 123829/123829 [00:03<00:00, 34971.15 examples/s]
	Add position_id column (Sample Packing) (num_proc=24): 0%\| \| 0/123829 [00:00<?, ? examples/s] Add position_id column (Sample Packing) (num_proc=24): 1%\| \| 1000/123829 [00:01<02:34, 797.52 examples/s] Add position_id column (Sample Packing) (num_proc=24): 16%\|█▍ \| 20000/123829 [00:01<00:05, 20005.84 examples/s] Add position_id column (Sample Packing) (num_proc=24): 25%\|██▎ \| 31000/123829 [00:01<00:04, 19157.66 examples/s] Add position_id column (Sample Packing) (num_proc=24): 36%\|███▎ \| 45000/123829 [00:02<00:02, 31006.41 examples/s] Add position_id column (Sample Packing) (num_proc=24): 44%\|███▉ \| 55000/123829 [00:02<00:02, 24605.20 examples/s] Add position_id column (Sample Packing) (num_proc=24): 54%\|████▊ \| 67000/123829 [00:02<00:01, 33902.10 examples/s] Add position_id column (Sample Packing) (num_proc=24): 61%\|█████▌ \| 76000/123829 [00:03<00:02, 18577.96 examples/s] Add position_id column (Sample Packing) (num_proc=24): 75%\|██████▊ \| 93000/123829 [00:03<00:01, 29649.99 examples/s] Add position_id column (Sample Packing) (num_proc=24): 82%\|██████▌ \| 102000/123829 [00:04<00:00, 23975.20 examples/s] Add position_id column (Sample Packing) (num_proc=24): 95%\|███████▌\| 117958/123829 [00:04<00:00, 34831.96 examples/s] Add position_id column (Sample Packing) (num_proc=24): 100%\|████████\| 123829/123829 [00:06<00:00, 20598.27 examples/s]
	Saving the dataset (0/24 shards): 0%\| \| 0/123829 [00:00<?, ? examples/s] Saving the dataset (0/24 shards): 2%\|▌ \| 2000/123829 [00:00<00:37, 3248.08 examples/s] Saving the dataset (1/24 shards): 49%\|███████████████ \| 60320/123829 [00:00<00:19, 3248.08 examples/s] Saving the dataset (2/24 shards): 53%\|████████████████▎ \| 65320/123829 [00:00<00:18, 3248.08 examples/s] Saving the dataset (3/24 shards): 63%\|███████████████████▍ \| 77480/123829 [00:00<00:14, 3248.08 examples/s] Saving the dataset (4/24 shards): 66%\|████████████████████▍ \| 81800/123829 [00:00<00:12, 3248.08 examples/s] Saving the dataset (5/24 shards): 67%\|████████████████████▊ \| 82960/123829 [00:00<00:12, 3248.08 examples/s] Saving the dataset (6/24 shards): 67%\|████████████████████▊ \| 83120/123829 [00:00<00:12, 3248.08 examples/s] Saving the dataset (7/24 shards): 67%\|████████████████████▊ \| 83120/123829 [00:00<00:12, 3248.08 examples/s] Saving the dataset (8/24 shards): 70%\|█████████████████████▌ \| 86280/123829 [00:00<00:11, 3248.08 examples/s] Saving the dataset (9/24 shards): 74%\|██████████████████████▉ \| 91440/123829 [00:00<00:09, 3248.08 examples/s] Saving the dataset (10/24 shards): 74%\|██████████████████████▏ \| 91600/123829 [00:00<00:09, 3248.08 examples/s] Saving the dataset (11/24 shards): 76%\|██████████████████████▋ \| 93760/123829 [00:00<00:09, 3248.08 examples/s] Saving the dataset (12/24 shards): 77%\|███████████████████████▏ \| 95919/123829 [00:00<00:08, 3248.08 examples/s] Saving the dataset (13/24 shards): 78%\|███████████████████████▌ \| 97079/123829 [00:00<00:08, 3248.08 examples/s] Saving the dataset (13/24 shards): 80%\|██████████████████████▍ \| 99079/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (14/24 shards): 84%\|██████████████████████▋ \| 104238/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (15/24 shards): 86%\|███████████████████████▏ \| 106397/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (16/24 shards): 87%\|███████████████████████▍ \| 107557/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (17/24 shards): 89%\|███████████████████████▉ \| 109716/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (18/24 shards): 90%\|████████████████████████▏ \| 110875/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (19/24 shards): 95%\|█████████████████████████▌ \| 117193/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (20/24 shards): 95%\|█████████████████████████▌ \| 117193/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (21/24 shards): 97%\|██████████████████████████▏\| 120352/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (22/24 shards): 98%\|██████████████████████████▍\| 121511/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (23/24 shards): 99%\|██████████████████████████▋\| 122670/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (24/24 shards): 100%\|███████████████████████████\| 123829/123829 [00:00<00:00, 185422.31 examples/s] Saving the dataset (24/24 shards): 100%\|███████████████████████████\| 123829/123829 [00:00<00:00, 149566.99 examples/s]
	Using Liger RMSNorm!
	Loading checkpoint shards: 0%\| \| 0/6 [00:00<?, ?it/s] Loading checkpoint shards: 100%\|███████████████████████████████████████████████████████\| 6/6 [00:00<00:00, 519.79it/s]
	Loading checkpoint shards: 0%\| \| 0/6 [00:00<?, ?it/s] Loading checkpoint shards: 100%\|███████████████████████████████████████████████████████\| 6/6 [00:00<00:00, 544.77it/s]
	[2025-11-26 14:54:12,836] [WARNING] [py.warnings._showwarnmsg:110] [PID:64100] /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:680: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
	warnings.warn(