File size: 21,242 Bytes
9f9bf08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
[2025-11-26 03:25:26,747] [INFO] [axolotl.utils.data.sft._load_raw_datasets:320] [PID:64100] Loading raw datasets...
[2025-11-26 03:25:29,268] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: ToastyPigeon/limarp-augmented-train-last-only with base_type: chat_template and prompt_style: None
[2025-11-26 03:25:29,972] [WARNING] [huggingface_hub.repocard.content:108] [PID:64100] Repo card metadata block was not found. Setting CardData to empty.
[2025-11-26 03:25:30,754] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: ToastyPigeon/mixed-medical-reasoning-formatted with base_type: chat_template and prompt_style: None
[2025-11-26 03:25:32,584] [WARNING] [huggingface_hub.repocard.content:108] [PID:64100] Repo card metadata block was not found. Setting CardData to empty.
[2025-11-26 03:25:33,613] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: ToastyPigeon/kimi-stories-instruct with base_type: chat_template and prompt_style: None
[2025-11-26 03:25:35,627] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: allura-forge/koto-instruct-sft-nothink with base_type: chat_template and prompt_style: None
[2025-11-26 03:25:36,310] [WARNING] [huggingface_hub.repocard.content:108] [PID:64100] Repo card metadata block was not found. Setting CardData to empty.
[2025-11-26 03:25:37,341] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: ToastyPigeon/SpringDragon-Instruct with base_type: chat_template and prompt_style: None

Tokenizing Prompts (num_proc=24):   0%|                                               | 0/2535 [00:00<?, ? examples/s]
Tokenizing Prompts (num_proc=24):   4%|β–ˆβ–Œ                                   | 106/2535 [00:03<01:30, 26.77 examples/s]
Tokenizing Prompts (num_proc=24):   8%|β–ˆβ–ˆβ–ˆ                                  | 212/2535 [00:04<00:46, 49.58 examples/s]
Tokenizing Prompts (num_proc=24):  17%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                              | 423/2535 [00:04<00:17, 124.16 examples/s]
Tokenizing Prompts (num_proc=24):  25%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                           | 635/2535 [00:05<00:10, 188.35 examples/s]
Tokenizing Prompts (num_proc=24):  33%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                        | 846/2535 [00:05<00:05, 293.27 examples/s]
Tokenizing Prompts (num_proc=24):  38%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                      | 951/2535 [00:05<00:05, 292.11 examples/s]
Tokenizing Prompts (num_proc=24):  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                 | 1268/2535 [00:06<00:02, 496.58 examples/s]
Tokenizing Prompts (num_proc=24):  58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–              | 1479/2535 [00:06<00:01, 548.63 examples/s]
Tokenizing Prompts (num_proc=24):  63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰             | 1585/2535 [00:06<00:01, 563.28 examples/s]
Tokenizing Prompts (num_proc=24):  67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž           | 1690/2535 [00:06<00:01, 582.65 examples/s]
Tokenizing Prompts (num_proc=24):  71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š          | 1795/2535 [00:07<00:01, 513.55 examples/s]
Tokenizing Prompts (num_proc=24):  75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–        | 1900/2535 [00:07<00:01, 356.77 examples/s]
Tokenizing Prompts (num_proc=24):  79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹       | 2006/2535 [00:08<00:01, 270.39 examples/s]
Tokenizing Prompts (num_proc=24):  83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–     | 2112/2535 [00:09<00:02, 202.48 examples/s]
Tokenizing Prompts (num_proc=24):  87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 2218/2535 [00:09<00:01, 222.42 examples/s]
Tokenizing Prompts (num_proc=24):  92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 2323/2535 [00:10<00:00, 213.17 examples/s]
Tokenizing Prompts (num_proc=24):  96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 2429/2535 [00:10<00:00, 239.41 examples/s]
Tokenizing Prompts (num_proc=24): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2535/2535 [00:13<00:00, 89.06 examples/s]
Tokenizing Prompts (num_proc=24): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2535/2535 [00:13<00:00, 186.31 examples/s]
[2025-11-26 03:25:52,171] [WARNING] [huggingface_hub.repocard.content:108] [PID:64100] Repo card metadata block was not found. Setting CardData to empty.
[2025-11-26 03:25:53,391] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:64100] Loading dataset: ToastyPigeon/tulu-mini with base_type: chat_template and prompt_style: None

Tokenizing Prompts (num_proc=24):   0%|                                              | 0/43790 [00:00<?, ? examples/s]
Tokenizing Prompts (num_proc=24):   2%|β–Š                                 | 1000/43790 [00:02<01:56, 366.05 examples/s]
Tokenizing Prompts (num_proc=24):   5%|β–ˆβ–Œ                                | 2000/43790 [00:03<01:08, 609.96 examples/s]
Tokenizing Prompts (num_proc=24):   7%|β–ˆβ–ˆβ–Ž                               | 3000/43790 [00:03<00:43, 938.14 examples/s]
Tokenizing Prompts (num_proc=24):   9%|β–ˆβ–ˆβ–ˆ                              | 4000/43790 [00:04<00:28, 1381.41 examples/s]
Tokenizing Prompts (num_proc=24):  18%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                           | 8000/43790 [00:04<00:08, 3979.88 examples/s]
Tokenizing Prompts (num_proc=24):  23%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž                        | 10000/43790 [00:04<00:09, 3732.18 examples/s]
Tokenizing Prompts (num_proc=24):  25%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                        | 11000/43790 [00:05<00:08, 3783.00 examples/s]
Tokenizing Prompts (num_proc=24):  29%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž                      | 12825/43790 [00:05<00:06, 4896.77 examples/s]
Tokenizing Prompts (num_proc=24):  33%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                     | 14475/43790 [00:05<00:05, 5426.74 examples/s]
Tokenizing Prompts (num_proc=24):  39%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                   | 17125/43790 [00:05<00:03, 8024.04 examples/s]
Tokenizing Prompts (num_proc=24):  43%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š                  | 18950/43790 [00:05<00:03, 7965.45 examples/s]
Tokenizing Prompts (num_proc=24):  47%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                | 20775/43790 [00:06<00:02, 8363.22 examples/s]
Tokenizing Prompts (num_proc=24):  59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–            | 25775/43790 [00:06<00:01, 14875.75 examples/s]
Tokenizing Prompts (num_proc=24):  65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–          | 28600/43790 [00:06<00:01, 10170.57 examples/s]
Tokenizing Prompts (num_proc=24):  71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š         | 31249/43790 [00:07<00:01, 6509.34 examples/s]
Tokenizing Prompts (num_proc=24):  75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ        | 32898/43790 [00:07<00:01, 7073.35 examples/s]
Tokenizing Prompts (num_proc=24):  79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–      | 34546/43790 [00:07<00:01, 7570.30 examples/s]
Tokenizing Prompts (num_proc=24):  83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–     | 36194/43790 [00:07<00:00, 8196.35 examples/s]
Tokenizing Prompts (num_proc=24):  86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 37842/43790 [00:08<00:00, 9217.97 examples/s]
Tokenizing Prompts (num_proc=24):  90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 39492/43790 [00:08<00:00, 6227.85 examples/s]
Tokenizing Prompts (num_proc=24):  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 41141/43790 [00:08<00:00, 5357.82 examples/s]
Tokenizing Prompts (num_proc=24):  96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 41966/43790 [00:09<00:00, 4723.69 examples/s]
Tokenizing Prompts (num_proc=24):  98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 42966/43790 [00:10<00:00, 2314.47 examples/s]
Tokenizing Prompts (num_proc=24): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 43790/43790 [00:11<00:00, 1655.33 examples/s]
Tokenizing Prompts (num_proc=24): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 43790/43790 [00:11<00:00, 3719.04 examples/s]

Dropping Long Sequences (>4096) (num_proc=24):   0%|                                | 0/132318 [00:00<?, ? examples/s]
Dropping Long Sequences (>4096) (num_proc=24):   1%|▏                   | 1000/132318 [00:01<02:22, 922.06 examples/s]
Dropping Long Sequences (>4096) (num_proc=24):   8%|β–ˆβ–Ž               | 10000/132318 [00:01<00:11, 11073.18 examples/s]
Dropping Long Sequences (>4096) (num_proc=24):  16%|β–ˆβ–ˆβ–‹              | 21000/132318 [00:01<00:04, 24182.19 examples/s]
Dropping Long Sequences (>4096) (num_proc=24):  21%|β–ˆβ–ˆβ–ˆβ–Œ             | 28000/132318 [00:01<00:04, 23802.12 examples/s]
Dropping Long Sequences (>4096) (num_proc=24):  30%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–           | 40000/132318 [00:01<00:02, 37748.88 examples/s]
Dropping Long Sequences (>4096) (num_proc=24):  36%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–          | 48000/132318 [00:01<00:02, 41263.96 examples/s]
Dropping Long Sequences (>4096) (num_proc=24):  42%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ          | 55000/132318 [00:02<00:02, 34709.41 examples/s]
Dropping Long Sequences (>4096) (num_proc=24):  48%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ         | 63000/132318 [00:02<00:01, 41579.13 examples/s]
Dropping Long Sequences (>4096) (num_proc=24):  53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰        | 70000/132318 [00:02<00:01, 46552.60 examples/s]
Dropping Long Sequences (>4096) (num_proc=24):  58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰       | 77000/132318 [00:02<00:01, 37040.55 examples/s]
Dropping Long Sequences (>4096) (num_proc=24):  66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–     | 87000/132318 [00:02<00:00, 47883.31 examples/s]
Dropping Long Sequences (>4096) (num_proc=24):  71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 94000/132318 [00:02<00:00, 48432.39 examples/s]
Dropping Long Sequences (>4096) (num_proc=24):  76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 100000/132318 [00:03<00:00, 38109.09 examples/s]
Dropping Long Sequences (>4096) (num_proc=24):  82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 109000/132318 [00:03<00:00, 47152.44 examples/s]
Dropping Long Sequences (>4096) (num_proc=24):  88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 116514/132318 [00:03<00:00, 52490.56 examples/s]
Dropping Long Sequences (>4096) (num_proc=24):  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 124105/132318 [00:03<00:00, 57668.18 examples/s]
Dropping Long Sequences (>4096) (num_proc=24):  99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 131292/132318 [00:03<00:00, 52604.65 examples/s]
Dropping Long Sequences (>4096) (num_proc=24): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 132318/132318 [00:03<00:00, 33596.79 examples/s]

Drop Samples with Zero Trainable Tokens (num_proc=24):   0%|                        | 0/123829 [00:00<?, ? examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=24):   1%|            | 1000/123829 [00:01<02:28, 826.71 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=24):  11%|β–ˆ        | 14000/123829 [00:01<00:07, 14416.15 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=24):  19%|β–ˆβ–‹       | 23000/123829 [00:01<00:04, 23616.20 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=24):  25%|β–ˆβ–ˆβ–Ž      | 31000/123829 [00:01<00:03, 25687.85 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=24):  34%|β–ˆβ–ˆβ–ˆ      | 42000/123829 [00:01<00:02, 37598.53 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=24):  40%|β–ˆβ–ˆβ–ˆβ–‹     | 50000/123829 [00:02<00:01, 37701.53 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=24):  47%|β–ˆβ–ˆβ–ˆβ–ˆβ–    | 58000/123829 [00:02<00:01, 44307.59 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=24):  56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 69000/123829 [00:02<00:00, 56119.64 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=24):  62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 77000/123829 [00:02<00:00, 50779.32 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=24):  69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 85000/123829 [00:02<00:00, 55240.90 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=24):  75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 93000/123829 [00:02<00:00, 59724.58 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=24):  81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 100160/123829 [00:02<00:00, 56181.91 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=24):  89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 110437/123829 [00:02<00:00, 66364.79 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=24):  97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 119871/123829 [00:03<00:00, 72983.46 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=24): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 123829/123829 [00:03<00:00, 34971.15 examples/s]

Add position_id column (Sample Packing) (num_proc=24):   0%|                        | 0/123829 [00:00<?, ? examples/s]
Add position_id column (Sample Packing) (num_proc=24):   1%|            | 1000/123829 [00:01<02:34, 797.52 examples/s]
Add position_id column (Sample Packing) (num_proc=24):  16%|β–ˆβ–       | 20000/123829 [00:01<00:05, 20005.84 examples/s]
Add position_id column (Sample Packing) (num_proc=24):  25%|β–ˆβ–ˆβ–Ž      | 31000/123829 [00:01<00:04, 19157.66 examples/s]
Add position_id column (Sample Packing) (num_proc=24):  36%|β–ˆβ–ˆβ–ˆβ–Ž     | 45000/123829 [00:02<00:02, 31006.41 examples/s]
Add position_id column (Sample Packing) (num_proc=24):  44%|β–ˆβ–ˆβ–ˆβ–‰     | 55000/123829 [00:02<00:02, 24605.20 examples/s]
Add position_id column (Sample Packing) (num_proc=24):  54%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 67000/123829 [00:02<00:01, 33902.10 examples/s]
Add position_id column (Sample Packing) (num_proc=24):  61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 76000/123829 [00:03<00:02, 18577.96 examples/s]
Add position_id column (Sample Packing) (num_proc=24):  75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 93000/123829 [00:03<00:01, 29649.99 examples/s]
Add position_id column (Sample Packing) (num_proc=24):  82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 102000/123829 [00:04<00:00, 23975.20 examples/s]
Add position_id column (Sample Packing) (num_proc=24):  95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 117958/123829 [00:04<00:00, 34831.96 examples/s]
Add position_id column (Sample Packing) (num_proc=24): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 123829/123829 [00:06<00:00, 20598.27 examples/s]

Saving the dataset (0/24 shards):   0%|                                             | 0/123829 [00:00<?, ? examples/s]
Saving the dataset (0/24 shards):   2%|β–Œ                               | 2000/123829 [00:00<00:37, 3248.08 examples/s]
Saving the dataset (1/24 shards):  49%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                | 60320/123829 [00:00<00:19, 3248.08 examples/s]
Saving the dataset (2/24 shards):  53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž              | 65320/123829 [00:00<00:18, 3248.08 examples/s]
Saving the dataset (3/24 shards):  63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–           | 77480/123829 [00:00<00:14, 3248.08 examples/s]
Saving the dataset (4/24 shards):  66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–          | 81800/123829 [00:00<00:12, 3248.08 examples/s]
Saving the dataset (5/24 shards):  67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š          | 82960/123829 [00:00<00:12, 3248.08 examples/s]
Saving the dataset (6/24 shards):  67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š          | 83120/123829 [00:00<00:12, 3248.08 examples/s]
Saving the dataset (7/24 shards):  67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š          | 83120/123829 [00:00<00:12, 3248.08 examples/s]
Saving the dataset (8/24 shards):  70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ         | 86280/123829 [00:00<00:11, 3248.08 examples/s]
Saving the dataset (9/24 shards):  74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰        | 91440/123829 [00:00<00:09, 3248.08 examples/s]
Saving the dataset (10/24 shards):  74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–       | 91600/123829 [00:00<00:09, 3248.08 examples/s]
Saving the dataset (11/24 shards):  76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹       | 93760/123829 [00:00<00:09, 3248.08 examples/s]
Saving the dataset (12/24 shards):  77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–      | 95919/123829 [00:00<00:08, 3248.08 examples/s]
Saving the dataset (13/24 shards):  78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ      | 97079/123829 [00:00<00:08, 3248.08 examples/s]
Saving the dataset (13/24 shards):  80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–     | 99079/123829 [00:00<00:00, 185422.31 examples/s]
Saving the dataset (14/24 shards):  84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 104238/123829 [00:00<00:00, 185422.31 examples/s]
Saving the dataset (15/24 shards):  86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 106397/123829 [00:00<00:00, 185422.31 examples/s]
Saving the dataset (16/24 shards):  87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 107557/123829 [00:00<00:00, 185422.31 examples/s]
Saving the dataset (17/24 shards):  89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 109716/123829 [00:00<00:00, 185422.31 examples/s]
Saving the dataset (18/24 shards):  90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 110875/123829 [00:00<00:00, 185422.31 examples/s]
Saving the dataset (19/24 shards):  95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 117193/123829 [00:00<00:00, 185422.31 examples/s]
Saving the dataset (20/24 shards):  95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 117193/123829 [00:00<00:00, 185422.31 examples/s]
Saving the dataset (21/24 shards):  97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 120352/123829 [00:00<00:00, 185422.31 examples/s]
Saving the dataset (22/24 shards):  98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 121511/123829 [00:00<00:00, 185422.31 examples/s]
Saving the dataset (23/24 shards):  99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 122670/123829 [00:00<00:00, 185422.31 examples/s]
Saving the dataset (24/24 shards): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 123829/123829 [00:00<00:00, 185422.31 examples/s]
Saving the dataset (24/24 shards): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 123829/123829 [00:00<00:00, 149566.99 examples/s]
Using Liger RMSNorm!

Loading checkpoint shards:   0%|                                                                | 0/6 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6/6 [00:00<00:00, 519.79it/s]

Loading checkpoint shards:   0%|                                                                | 0/6 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6/6 [00:00<00:00, 544.77it/s]
[2025-11-26 14:54:12,836] [WARNING] [py.warnings._showwarnmsg:110] [PID:64100] /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:680: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(