Create training_logs.txt
Browse files- training_logs.txt +181 -0
training_logs.txt
ADDED
|
@@ -0,0 +1,181 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Warmup configured to 2,110 steps (~0.75 epoch, <= 1.00 epoch cap).
|
| 2 |
+
Step checkpoints configured every 1,000 optimizer steps.
|
| 3 |
+
Launching DDP run:
|
| 4 |
+
OMP_NUM_THREADS per rank: 2
|
| 5 |
+
/usr/local/bin/torchrun --standalone --nnodes=1 --nproc_per_node 2 /kaggle/input/datasets/chickaboomcmurtrie/magic-midi/training/train_variant_e_40m_ddp.py --pretokenized_manifest /kaggle/working/outputs/sub100m_e_40m_100k/source_pretokenized/manifest.json --pretokenized_root /kaggle/input/datasets/chickaboomcmurtrie/pulse88-tokenize-100k --output_dir /kaggle/working/outputs/sub100m_e_40m_100k --max_pieces 100000 --seed 42 --seed_length 512 --continuation_length 1536 --max_sequence_length 2048 --epochs 20 --batch_size 2 --grad_accumulation_steps 8 --learning_rate 0.0002 --warmup_ratio 0.03750444365446143 --warmup_steps 2110 --min_lr_ratio 0.1 --weight_decay 0.01 --label_smoothing 0.1 --max_grad_norm 1.0 --num_workers 4 --log_every_n_steps 50 --save_every_n_steps 1000 --save_every_n_epochs 5 --resume_mode remaining --no_auto_resume
|
| 6 |
+
[W412 20:47:10.890281599 socket.cpp:207] [c10d] The hostname of the client socket cannot be retrieved. err=-3
|
| 7 |
+
/kaggle/input/datasets/chickaboomcmurtrie/magic-midi/model/mamba_block.py:17: UserWarning: mamba-ssm not available (requires CUDA). Using torch reference fallback for development. Install mamba-ssm on Colab/Kaggle for full performance.
|
| 8 |
+
warnings.warn(
|
| 9 |
+
/kaggle/input/datasets/chickaboomcmurtrie/magic-midi/model/mamba_block.py:21: UserWarning: mamba-ssm import details: No module named 'mamba_ssm'
|
| 10 |
+
warnings.warn(f"mamba-ssm import details: {exc}")
|
| 11 |
+
/kaggle/input/datasets/chickaboomcmurtrie/magic-midi/model/mamba_block.py:17: UserWarning: mamba-ssm not available (requires CUDA). Using torch reference fallback for development. Install mamba-ssm on Colab/Kaggle for full performance.
|
| 12 |
+
warnings.warn(
|
| 13 |
+
/kaggle/input/datasets/chickaboomcmurtrie/magic-midi/model/mamba_block.py:21: UserWarning: mamba-ssm import details: No module named 'mamba_ssm'
|
| 14 |
+
warnings.warn(f"mamba-ssm import details: {exc}")
|
| 15 |
+
[W412 20:47:19.363074762 socket.cpp:207] [c10d] The hostname of the client socket cannot be retrieved. err=-3
|
| 16 |
+
[W412 20:47:19.369976685 socket.cpp:207] [c10d] The hostname of the client socket cannot be retrieved. err=-3
|
| 17 |
+
/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py:83: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
|
| 18 |
+
return func(*args, **kwargs)
|
| 19 |
+
[rank0]:[W412 20:47:19.400861194 ProcessGroupNCCL.cpp:5138] Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group()
|
| 20 |
+
Variant-E 40M DDP model params: 40,784,528 (40.78M)
|
| 21 |
+
Backend status: {'gdn_using_fallback': False, 'cfc_using_fallback': False}
|
| 22 |
+
[20:50:56] [INFO] Loaded pre-tokenized manifest: kept=86199 skipped_unresolved=0 skipped_short=13796
|
| 23 |
+
[20:50:56] [INFO] Loaded pre-tokenized manifest: kept=86199 skipped_unresolved=0 skipped_short=13796
|
| 24 |
+
/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py:865: UserWarning: Memory Efficient attention defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.cu:897.)
|
| 25 |
+
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
|
| 26 |
+
/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py:865: UserWarning: Memory Efficient attention defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.cu:897.)
|
| 27 |
+
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
|
| 28 |
+
epoch=001 step=000050 train_loss=5.1351 lr=4.739336e-06
|
| 29 |
+
epoch=001 step=000100 train_loss=5.1077 lr=9.478673e-06
|
| 30 |
+
epoch=001 step=000150 train_loss=5.0852 lr=1.421801e-05
|
| 31 |
+
epoch=001 step=000200 train_loss=5.0700 lr=1.895735e-05
|
| 32 |
+
epoch=001 step=000250 train_loss=5.0517 lr=2.369668e-05
|
| 33 |
+
epoch=001 step=000300 train_loss=5.0310 lr=2.843602e-05
|
| 34 |
+
epoch=001 step=000350 train_loss=5.0036 lr=3.317536e-05
|
| 35 |
+
epoch=001 step=000400 train_loss=4.9425 lr=3.791469e-05
|
| 36 |
+
epoch=001 step=000450 train_loss=4.8817 lr=4.265403e-05
|
| 37 |
+
epoch=001 step=000500 train_loss=4.8229 lr=4.739336e-05
|
| 38 |
+
epoch=001 step=000550 train_loss=4.7634 lr=5.213270e-05
|
| 39 |
+
epoch=001 step=000600 train_loss=4.7001 lr=5.687204e-05
|
| 40 |
+
epoch=001 step=000650 train_loss=4.6303 lr=6.161137e-05
|
| 41 |
+
epoch=001 step=000700 train_loss=4.5589 lr=6.635071e-05
|
| 42 |
+
epoch=001 step=000750 train_loss=4.4806 lr=7.109005e-05
|
| 43 |
+
epoch=001 step=000800 train_loss=4.4025 lr=7.582938e-05
|
| 44 |
+
epoch=001 step=000850 train_loss=4.3220 lr=8.056872e-05
|
| 45 |
+
epoch=001 step=000900 train_loss=4.2364 lr=8.530806e-05
|
| 46 |
+
epoch=001 step=000950 train_loss=4.1570 lr=9.004739e-05
|
| 47 |
+
Step checkpoint saved: epoch=001 step=001000 loss~4.0672
|
| 48 |
+
epoch=001 step=001000 train_loss=4.0672 lr=9.478673e-05
|
| 49 |
+
epoch=001 step=001050 train_loss=3.9799 lr=9.952607e-05
|
| 50 |
+
epoch=001 step=001100 train_loss=3.8986 lr=1.042654e-04
|
| 51 |
+
epoch=001 step=001150 train_loss=3.8111 lr=1.090047e-04
|
| 52 |
+
epoch=001 step=001200 train_loss=3.7314 lr=1.137441e-04
|
| 53 |
+
epoch=001 step=001250 train_loss=3.6563 lr=1.184834e-04
|
| 54 |
+
epoch=001 step=001300 train_loss=3.5761 lr=1.232227e-04
|
| 55 |
+
epoch=001 step=001350 train_loss=3.5077 lr=1.279621e-04
|
| 56 |
+
epoch=001 step=001400 train_loss=3.4228 lr=1.327014e-04
|
| 57 |
+
epoch=001 step=001450 train_loss=3.3561 lr=1.374408e-04
|
| 58 |
+
epoch=001 step=001500 train_loss=3.2874 lr=1.421801e-04
|
| 59 |
+
epoch=001 step=001550 train_loss=3.2238 lr=1.469194e-04
|
| 60 |
+
epoch=001 step=001600 train_loss=3.1665 lr=1.516588e-04
|
| 61 |
+
epoch=001 step=001650 train_loss=3.0983 lr=1.563981e-04
|
| 62 |
+
epoch=001 step=001700 train_loss=3.0406 lr=1.611374e-04
|
| 63 |
+
epoch=001 step=001750 train_loss=2.9943 lr=1.658768e-04
|
| 64 |
+
epoch=001 step=001800 train_loss=2.9474 lr=1.706161e-04
|
| 65 |
+
epoch=001 step=001850 train_loss=2.8880 lr=1.753555e-04
|
| 66 |
+
epoch=001 step=001900 train_loss=2.8516 lr=1.800948e-04
|
| 67 |
+
epoch=001 step=001950 train_loss=2.8169 lr=1.848341e-04
|
| 68 |
+
Step checkpoint saved: epoch=001 step=002000 loss~2.7854
|
| 69 |
+
epoch=001 step=002000 train_loss=2.7854 lr=1.895735e-04
|
| 70 |
+
epoch=001 step=002050 train_loss=2.7574 lr=1.943128e-04
|
| 71 |
+
epoch=001 step=002100 train_loss=2.7148 lr=1.990521e-04
|
| 72 |
+
epoch=001 step=002150 train_loss=2.6977 lr=1.999997e-04
|
| 73 |
+
epoch=001 step=002200 train_loss=2.6649 lr=1.999983e-04
|
| 74 |
+
epoch=001 step=002250 train_loss=2.6285 lr=1.999960e-04
|
| 75 |
+
epoch=001 step=002300 train_loss=2.6148 lr=1.999925e-04
|
| 76 |
+
epoch=001 step=002350 train_loss=2.5961 lr=1.999881e-04
|
| 77 |
+
epoch=001 step=002400 train_loss=2.5664 lr=1.999826e-04
|
| 78 |
+
Epoch 001 | train_loss=3.7736 | val_loss=2.5591 | ppl=12.92
|
| 79 |
+
/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py:83: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
|
| 80 |
+
return func(*args, **kwargs)
|
| 81 |
+
epoch=002 step=002450 train_loss=2.5690 lr=1.999761e-04
|
| 82 |
+
epoch=002 step=002500 train_loss=2.5478 lr=1.999686e-04
|
| 83 |
+
epoch=002 step=002550 train_loss=2.5284 lr=1.999600e-04
|
| 84 |
+
epoch=002 step=002600 train_loss=2.5071 lr=1.999505e-04
|
| 85 |
+
epoch=002 step=002650 train_loss=2.4980 lr=1.999398e-04
|
| 86 |
+
epoch=002 step=002700 train_loss=2.4797 lr=1.999282e-04
|
| 87 |
+
epoch=002 step=002750 train_loss=2.4798 lr=1.999155e-04
|
| 88 |
+
epoch=002 step=002800 train_loss=2.4632 lr=1.999018e-04
|
| 89 |
+
epoch=002 step=002850 train_loss=2.4440 lr=1.998870e-04
|
| 90 |
+
epoch=002 step=002900 train_loss=2.4474 lr=1.998712e-04
|
| 91 |
+
epoch=002 step=002950 train_loss=2.4423 lr=1.998544e-04
|
| 92 |
+
Step checkpoint saved: epoch=002 step=003000 loss~2.4265
|
| 93 |
+
#NOTE: This model was trained in two separate sessions This is where session two began.
|
| 94 |
+
epoch=003 step=003050 train_loss=2.4215 lr=1.998177e-04
|
| 95 |
+
epoch=003 step=003100 train_loss=2.4032 lr=1.997978e-04
|
| 96 |
+
epoch=003 step=003150 train_loss=2.4000 lr=1.997769e-04
|
| 97 |
+
epoch=003 step=003200 train_loss=2.3878 lr=1.997549e-04
|
| 98 |
+
epoch=003 step=003250 train_loss=2.3915 lr=1.997319e-04
|
| 99 |
+
epoch=003 step=003300 train_loss=2.3817 lr=1.997079e-04
|
| 100 |
+
epoch=003 step=003350 train_loss=2.3815 lr=1.996829e-04
|
| 101 |
+
epoch=003 step=003400 train_loss=2.3769 lr=1.996568e-04
|
| 102 |
+
epoch=003 step=003450 train_loss=2.3592 lr=1.996297e-04
|
| 103 |
+
epoch=003 step=003500 train_loss=2.3582 lr=1.996016e-04
|
| 104 |
+
epoch=003 step=003550 train_loss=2.3433 lr=1.995724e-04
|
| 105 |
+
epoch=003 step=003600 train_loss=2.3343 lr=1.995422e-04
|
| 106 |
+
epoch=003 step=003650 train_loss=2.3458 lr=1.995110e-04
|
| 107 |
+
epoch=003 step=003700 train_loss=2.3448 lr=1.994788e-04
|
| 108 |
+
epoch=003 step=003750 train_loss=2.3167 lr=1.994455e-04
|
| 109 |
+
epoch=003 step=003800 train_loss=2.3137 lr=1.994112e-04
|
| 110 |
+
epoch=003 step=003850 train_loss=2.3265 lr=1.993759e-04
|
| 111 |
+
epoch=003 step=003900 train_loss=2.3250 lr=1.993396e-04
|
| 112 |
+
epoch=003 step=003950 train_loss=2.3083 lr=1.993022e-04
|
| 113 |
+
Step checkpoint saved: epoch=003 step=004000 loss~2.3077
|
| 114 |
+
epoch=003 step=004000 train_loss=2.3077 lr=1.992638e-04
|
| 115 |
+
epoch=003 step=004050 train_loss=2.3036 lr=1.992244e-04
|
| 116 |
+
epoch=003 step=004100 train_loss=2.2895 lr=1.991840e-04
|
| 117 |
+
epoch=003 step=004150 train_loss=2.2808 lr=1.991425e-04
|
| 118 |
+
epoch=003 step=004200 train_loss=2.2893 lr=1.991000e-04
|
| 119 |
+
epoch=003 step=004250 train_loss=2.2891 lr=1.990565e-04
|
| 120 |
+
epoch=003 step=004300 train_loss=2.2760 lr=1.990120e-04
|
| 121 |
+
epoch=003 step=004350 train_loss=2.2777 lr=1.989665e-04
|
| 122 |
+
epoch=003 step=004400 train_loss=2.2768 lr=1.989199e-04
|
| 123 |
+
epoch=003 step=004450 train_loss=2.2673 lr=1.988723e-04
|
| 124 |
+
epoch=003 step=004500 train_loss=2.2637 lr=1.988237e-04
|
| 125 |
+
epoch=003 step=004550 train_loss=2.2659 lr=1.987741e-04
|
| 126 |
+
epoch=003 step=004600 train_loss=2.2515 lr=1.987235e-04
|
| 127 |
+
epoch=003 step=004650 train_loss=2.2331 lr=1.986718e-04
|
| 128 |
+
epoch=003 step=004700 train_loss=2.2511 lr=1.986191e-04
|
| 129 |
+
epoch=003 step=004750 train_loss=2.2188 lr=1.985655e-04
|
| 130 |
+
epoch=003 step=004800 train_loss=2.2482 lr=1.985108e-04
|
| 131 |
+
epoch=003 step=004850 train_loss=2.2448 lr=1.984550e-04
|
| 132 |
+
epoch=003 step=004900 train_loss=2.2246 lr=1.983983e-04
|
| 133 |
+
epoch=003 step=004950 train_loss=2.2188 lr=1.983406e-04
|
| 134 |
+
Step checkpoint saved: epoch=003 step=005000 loss~2.2113
|
| 135 |
+
epoch=003 step=005000 train_loss=2.2113 lr=1.982818e-04
|
| 136 |
+
epoch=003 step=005050 train_loss=2.2207 lr=1.982220e-04
|
| 137 |
+
epoch=003 step=005100 train_loss=2.2103 lr=1.981613e-04
|
| 138 |
+
epoch=003 step=005150 train_loss=2.1865 lr=1.980995e-04
|
| 139 |
+
epoch=003 step=005200 train_loss=2.1997 lr=1.980367e-04
|
| 140 |
+
epoch=003 step=005250 train_loss=2.2105 lr=1.979729e-04
|
| 141 |
+
epoch=003 step=005300 train_loss=2.1968 lr=1.979080e-04
|
| 142 |
+
epoch=003 step=005350 train_loss=2.1893 lr=1.978422e-04
|
| 143 |
+
epoch=003 step=005400 train_loss=2.1870 lr=1.977754e-04
|
| 144 |
+
Epoch 003 | train_loss=2.2880 | val_loss=2.1674 | ppl=8.74
|
| 145 |
+
Epoch bundle saved: /kaggle/working/outputs/sub100m_e_40m_100k/kaggle_model_exports/epoch_003
|
| 146 |
+
/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py:83: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
|
| 147 |
+
return func(*args, **kwargs)
|
| 148 |
+
epoch=004 step=005450 train_loss=2.1949 lr=1.977075e-04
|
| 149 |
+
epoch=004 step=005500 train_loss=2.1765 lr=1.976387e-04
|
| 150 |
+
epoch=004 step=005550 train_loss=2.1790 lr=1.975688e-04
|
| 151 |
+
epoch=004 step=005600 train_loss=2.1691 lr=1.974980e-04
|
| 152 |
+
epoch=004 step=005650 train_loss=2.1764 lr=1.974261e-04
|
| 153 |
+
epoch=004 step=005700 train_loss=2.1610 lr=1.973533e-04
|
| 154 |
+
epoch=004 step=005750 train_loss=2.1606 lr=1.972794e-04
|
| 155 |
+
epoch=004 step=005800 train_loss=2.1540 lr=1.972045e-04
|
| 156 |
+
epoch=004 step=005850 train_loss=2.1448 lr=1.971287e-04
|
| 157 |
+
epoch=004 step=005900 train_loss=2.1464 lr=1.970518e-04
|
| 158 |
+
epoch=004 step=005950 train_loss=2.1550 lr=1.969739e-04
|
| 159 |
+
Step checkpoint saved: epoch=004 step=006000 loss~2.1383
|
| 160 |
+
epoch=004 step=006000 train_loss=2.1383 lr=1.968951e-04
|
| 161 |
+
epoch=004 step=006050 train_loss=2.1607 lr=1.968152e-04
|
| 162 |
+
epoch=004 step=006100 train_loss=2.1325 lr=1.967344e-04
|
| 163 |
+
epoch=004 step=006150 train_loss=2.1383 lr=1.966525e-04
|
| 164 |
+
epoch=004 step=006200 train_loss=2.1425 lr=1.965697e-04
|
| 165 |
+
epoch=004 step=006250 train_loss=2.1288 lr=1.964859e-04
|
| 166 |
+
epoch=004 step=006300 train_loss=2.1408 lr=1.964011e-04
|
| 167 |
+
epoch=004 step=006350 train_loss=2.1292 lr=1.963152e-04
|
| 168 |
+
epoch=004 step=006400 train_loss=2.1468 lr=1.962284e-04
|
| 169 |
+
epoch=004 step=006450 train_loss=2.1370 lr=1.961406e-04
|
| 170 |
+
epoch=004 step=006500 train_loss=2.1126 lr=1.960519e-04
|
| 171 |
+
epoch=004 step=006550 train_loss=2.1325 lr=1.959621e-04
|
| 172 |
+
epoch=004 step=006600 train_loss=2.1084 lr=1.958714e-04
|
| 173 |
+
epoch=004 step=006650 train_loss=2.1124 lr=1.957796e-04
|
| 174 |
+
epoch=004 step=006700 train_loss=2.1193 lr=1.956869e-04
|
| 175 |
+
epoch=004 step=006750 train_loss=2.1147 lr=1.955932e-04
|
| 176 |
+
epoch=004 step=006800 train_loss=2.1131 lr=1.954985e-04
|
| 177 |
+
epoch=004 step=006850 train_loss=2.0983 lr=1.954029e-04
|
| 178 |
+
epoch=004 step=006900 train_loss=2.1052 lr=1.953062e-04
|
| 179 |
+
epoch=004 step=006950 train_loss=2.1126 lr=1.952086e-04
|
| 180 |
+
Step checkpoint saved: epoch=004 step=007000 loss~2.1097
|
| 181 |
+
epoch=004 step=007000 train_loss=2.1097 lr=1.951100e-04
|