Qwen3-8B-test / debug.log
Tibogoss's picture
Upload folder using huggingface_hub
07dbf42 verified
[2025-09-29 16:16:56,308] [INFO] [axolotl.utils.data.sft._load_raw_datasets:320] [PID:23243] Loading raw datasets...
[2025-09-29 16:16:56,541] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:23243] Loading dataset: /workspace/outputs/training_data/ with base_type: chat_template and prompt_style: None
Dropping Long Sequences (>1024) (num_proc=192): 0%| | 0/1918 [00:00<?, ? examples/s] Dropping Long Sequences (>1024) (num_proc=192): 1%|▏ | 10/1918 [00:02<08:04, 3.94 examples/s] Dropping Long Sequences (>1024) (num_proc=192): 4%|β–‰ | 70/1918 [00:02<00:52, 35.45 examples/s] Dropping Long Sequences (>1024) (num_proc=192): 8%|β–ˆβ–Š | 150/1918 [00:02<00:20, 85.95 examples/s] Dropping Long Sequences (>1024) (num_proc=192): 10%|β–ˆβ–ˆβ–Ž | 200/1918 [00:02<00:14, 122.09 examples/s] Dropping Long Sequences (>1024) (num_proc=192): 14%|β–ˆβ–ˆβ–‰ | 260/1918 [00:03<00:09, 174.73 examples/s] Dropping Long Sequences (>1024) (num_proc=192): 17%|β–ˆβ–ˆβ–ˆβ–‹ | 320/1918 [00:03<00:06, 231.88 examples/s] Dropping Long Sequences (>1024) (num_proc=192): 20%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 380/1918 [00:03<00:05, 290.99 examples/s] Dropping Long Sequences (>1024) (num_proc=192): 23%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 450/1918 [00:03<00:04, 357.52 examples/s] Dropping Long Sequences (>1024) (num_proc=192): 27%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 520/1918 [00:03<00:03, 417.39 examples/s] Dropping Long Sequences (>1024) (num_proc=192): 30%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 580/1918 [00:03<00:02, 446.62 examples/s] Dropping Long Sequences (>1024) (num_proc=192): 34%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 660/1918 [00:03<00:02, 515.42 examples/s] Dropping Long Sequences (>1024) (num_proc=192): 38%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 730/1918 [00:03<00:02, 556.77 examples/s] Dropping Long Sequences (>1024) (num_proc=192): 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 800/1918 [00:03<00:01, 570.35 examples/s] Dropping Long Sequences (>1024) (num_proc=192): 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 880/1918 [00:04<00:01, 614.25 examples/s] Dropping Long Sequences (>1024) (num_proc=192): 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1590/1918 [00:04<00:00, 2349.82 examples/s] Dropping Long Sequences (>1024) (num_proc=192): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1918/1918 [00:04<00:00, 410.73 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 0%| | 0/1918 [00:00<?, ? examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 1%| | 10/1918 [00:02<07:59, 3.98 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 2%|β–Ž | 40/1918 [00:02<01:35, 19.64 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 4%|β–‹ | 80/1918 [00:02<00:40, 45.26 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 6%|β–Š | 110/1918 [00:02<00:27, 66.20 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 7%|β–ˆ | 140/1918 [00:03<00:20, 88.47 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 9%|β–ˆβ– | 170/1918 [00:03<00:15, 113.89 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 10%|β–ˆβ– | 200/1918 [00:03<00:12, 142.67 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 12%|β–ˆβ–‹ | 230/1918 [00:03<00:15, 106.32 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 14%|β–ˆβ–‰ | 260/1918 [00:03<00:12, 130.44 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 15%|β–ˆβ–ˆ | 290/1918 [00:03<00:10, 153.03 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 17%|β–ˆβ–ˆβ–Ž | 320/1918 [00:04<00:09, 174.05 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 18%|β–ˆβ–ˆβ–Œ | 350/1918 [00:04<00:09, 172.70 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 20%|β–ˆβ–ˆβ–Š | 380/1918 [00:04<00:07, 192.97 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 21%|β–ˆβ–ˆβ–‰ | 410/1918 [00:04<00:07, 203.26 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 23%|β–ˆβ–ˆβ–ˆβ–Ž | 450/1918 [00:04<00:06, 230.57 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 25%|β–ˆβ–ˆβ–ˆβ–Œ | 480/1918 [00:04<00:05, 243.14 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 27%|β–ˆβ–ˆβ–ˆβ–‹ | 510/1918 [00:04<00:05, 252.14 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 28%|β–ˆβ–ˆβ–ˆβ–‰ | 540/1918 [00:04<00:05, 253.64 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 30%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 570/1918 [00:05<00:05, 261.18 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 31%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 600/1918 [00:05<00:05, 254.92 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 33%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 630/1918 [00:05<00:05, 253.38 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 34%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 660/1918 [00:05<00:04, 262.62 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 36%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 700/1918 [00:05<00:04, 273.67 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 850/1918 [00:05<00:01, 593.81 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1918/1918 [00:06<00:00, 307.84 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 0%| | 0/1918 [00:00<?, ? examples/s] Add position_id column (Sample Packing) (num_proc=192): 1%| | 10/1918 [00:02<08:32, 3.72 examples/s] Add position_id column (Sample Packing) (num_proc=192): 4%|β–Œ | 70/1918 [00:02<00:58, 31.60 examples/s] Add position_id column (Sample Packing) (num_proc=192): 7%|β–ˆ | 130/1918 [00:03<00:26, 66.22 examples/s] Add position_id column (Sample Packing) (num_proc=192): 8%|β–ˆβ–Ž | 160/1918 [00:03<00:22, 79.58 examples/s] Add position_id column (Sample Packing) (num_proc=192): 10%|β–ˆβ– | 200/1918 [00:03<00:15, 108.67 examples/s] Add position_id column (Sample Packing) (num_proc=192): 13%|β–ˆβ–Š | 240/1918 [00:03<00:12, 139.08 examples/s] Add position_id column (Sample Packing) (num_proc=192): 15%|β–ˆβ–ˆ | 280/1918 [00:03<00:09, 174.94 examples/s] Add position_id column (Sample Packing) (num_proc=192): 17%|β–ˆβ–ˆβ–Ž | 320/1918 [00:03<00:07, 210.84 examples/s] Add position_id column (Sample Packing) (num_proc=192): 19%|β–ˆβ–ˆβ–‹ | 360/1918 [00:03<00:06, 245.13 examples/s] Add position_id column (Sample Packing) (num_proc=192): 21%|β–ˆβ–ˆβ–‰ | 400/1918 [00:03<00:05, 265.91 examples/s] Add position_id column (Sample Packing) (num_proc=192): 23%|β–ˆβ–ˆβ–ˆβ– | 440/1918 [00:04<00:05, 287.55 examples/s] Add position_id column (Sample Packing) (num_proc=192): 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1000/1918 [00:04<00:00, 1554.43 examples/s] Add position_id column (Sample Packing) (num_proc=192): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1918/1918 [00:04<00:00, 407.87 examples/s]
Saving the dataset (0/7 shards): 0%| | 0/1918 [00:00<?, ? examples/s] Saving the dataset (0/7 shards): 14%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 274/1918 [00:00<00:01, 1299.27 examples/s] Saving the dataset (1/7 shards): 14%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 274/1918 [00:00<00:01, 1299.27 examples/s] Saving the dataset (2/7 shards): 29%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 548/1918 [00:00<00:01, 1299.27 examples/s] Saving the dataset (3/7 shards): 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 822/1918 [00:00<00:00, 1299.27 examples/s] Saving the dataset (4/7 shards): 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1096/1918 [00:00<00:00, 1299.27 examples/s] Saving the dataset (5/7 shards): 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1370/1918 [00:00<00:00, 1299.27 examples/s] Saving the dataset (6/7 shards): 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1644/1918 [00:00<00:00, 1299.27 examples/s] Saving the dataset (7/7 shards): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1918/1918 [00:00<00:00, 1299.27 examples/s] Saving the dataset (7/7 shards): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1918/1918 [00:00<00:00, 6023.49 examples/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 20%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1/5 [00:00<00:02, 1.35it/s] Loading checkpoint shards: 40%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 2/5 [00:01<00:02, 1.40it/s] Loading checkpoint shards: 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 3/5 [00:02<00:01, 1.43it/s] Loading checkpoint shards: 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 4/5 [00:02<00:00, 1.51it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:02<00:00, 1.99it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:02<00:00, 1.70it/s]