Chickaboo commited on
Commit
5be4e9b
·
verified ·
1 Parent(s): 2fea5ee

Create training_logs.txt

Browse files
Files changed (1) hide show
  1. training_logs.txt +181 -0
training_logs.txt ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Warmup configured to 2,110 steps (~0.75 epoch, <= 1.00 epoch cap).
2
+ Step checkpoints configured every 1,000 optimizer steps.
3
+ Launching DDP run:
4
+ OMP_NUM_THREADS per rank: 2
5
+ /usr/local/bin/torchrun --standalone --nnodes=1 --nproc_per_node 2 /kaggle/input/datasets/chickaboomcmurtrie/magic-midi/training/train_variant_e_40m_ddp.py --pretokenized_manifest /kaggle/working/outputs/sub100m_e_40m_100k/source_pretokenized/manifest.json --pretokenized_root /kaggle/input/datasets/chickaboomcmurtrie/pulse88-tokenize-100k --output_dir /kaggle/working/outputs/sub100m_e_40m_100k --max_pieces 100000 --seed 42 --seed_length 512 --continuation_length 1536 --max_sequence_length 2048 --epochs 20 --batch_size 2 --grad_accumulation_steps 8 --learning_rate 0.0002 --warmup_ratio 0.03750444365446143 --warmup_steps 2110 --min_lr_ratio 0.1 --weight_decay 0.01 --label_smoothing 0.1 --max_grad_norm 1.0 --num_workers 4 --log_every_n_steps 50 --save_every_n_steps 1000 --save_every_n_epochs 5 --resume_mode remaining --no_auto_resume
6
+ [W412 20:47:10.890281599 socket.cpp:207] [c10d] The hostname of the client socket cannot be retrieved. err=-3
7
+ /kaggle/input/datasets/chickaboomcmurtrie/magic-midi/model/mamba_block.py:17: UserWarning: mamba-ssm not available (requires CUDA). Using torch reference fallback for development. Install mamba-ssm on Colab/Kaggle for full performance.
8
+ warnings.warn(
9
+ /kaggle/input/datasets/chickaboomcmurtrie/magic-midi/model/mamba_block.py:21: UserWarning: mamba-ssm import details: No module named 'mamba_ssm'
10
+ warnings.warn(f"mamba-ssm import details: {exc}")
11
+ /kaggle/input/datasets/chickaboomcmurtrie/magic-midi/model/mamba_block.py:17: UserWarning: mamba-ssm not available (requires CUDA). Using torch reference fallback for development. Install mamba-ssm on Colab/Kaggle for full performance.
12
+ warnings.warn(
13
+ /kaggle/input/datasets/chickaboomcmurtrie/magic-midi/model/mamba_block.py:21: UserWarning: mamba-ssm import details: No module named 'mamba_ssm'
14
+ warnings.warn(f"mamba-ssm import details: {exc}")
15
+ [W412 20:47:19.363074762 socket.cpp:207] [c10d] The hostname of the client socket cannot be retrieved. err=-3
16
+ [W412 20:47:19.369976685 socket.cpp:207] [c10d] The hostname of the client socket cannot be retrieved. err=-3
17
+ /usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py:83: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
18
+ return func(*args, **kwargs)
19
+ [rank0]:[W412 20:47:19.400861194 ProcessGroupNCCL.cpp:5138] Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group()
20
+ Variant-E 40M DDP model params: 40,784,528 (40.78M)
21
+ Backend status: {'gdn_using_fallback': False, 'cfc_using_fallback': False}
22
+ [20:50:56] [INFO] Loaded pre-tokenized manifest: kept=86199 skipped_unresolved=0 skipped_short=13796
23
+ [20:50:56] [INFO] Loaded pre-tokenized manifest: kept=86199 skipped_unresolved=0 skipped_short=13796
24
+ /usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py:865: UserWarning: Memory Efficient attention defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.cu:897.)
25
+ return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
26
+ /usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py:865: UserWarning: Memory Efficient attention defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.cu:897.)
27
+ return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
28
+ epoch=001 step=000050 train_loss=5.1351 lr=4.739336e-06
29
+ epoch=001 step=000100 train_loss=5.1077 lr=9.478673e-06
30
+ epoch=001 step=000150 train_loss=5.0852 lr=1.421801e-05
31
+ epoch=001 step=000200 train_loss=5.0700 lr=1.895735e-05
32
+ epoch=001 step=000250 train_loss=5.0517 lr=2.369668e-05
33
+ epoch=001 step=000300 train_loss=5.0310 lr=2.843602e-05
34
+ epoch=001 step=000350 train_loss=5.0036 lr=3.317536e-05
35
+ epoch=001 step=000400 train_loss=4.9425 lr=3.791469e-05
36
+ epoch=001 step=000450 train_loss=4.8817 lr=4.265403e-05
37
+ epoch=001 step=000500 train_loss=4.8229 lr=4.739336e-05
38
+ epoch=001 step=000550 train_loss=4.7634 lr=5.213270e-05
39
+ epoch=001 step=000600 train_loss=4.7001 lr=5.687204e-05
40
+ epoch=001 step=000650 train_loss=4.6303 lr=6.161137e-05
41
+ epoch=001 step=000700 train_loss=4.5589 lr=6.635071e-05
42
+ epoch=001 step=000750 train_loss=4.4806 lr=7.109005e-05
43
+ epoch=001 step=000800 train_loss=4.4025 lr=7.582938e-05
44
+ epoch=001 step=000850 train_loss=4.3220 lr=8.056872e-05
45
+ epoch=001 step=000900 train_loss=4.2364 lr=8.530806e-05
46
+ epoch=001 step=000950 train_loss=4.1570 lr=9.004739e-05
47
+ Step checkpoint saved: epoch=001 step=001000 loss~4.0672
48
+ epoch=001 step=001000 train_loss=4.0672 lr=9.478673e-05
49
+ epoch=001 step=001050 train_loss=3.9799 lr=9.952607e-05
50
+ epoch=001 step=001100 train_loss=3.8986 lr=1.042654e-04
51
+ epoch=001 step=001150 train_loss=3.8111 lr=1.090047e-04
52
+ epoch=001 step=001200 train_loss=3.7314 lr=1.137441e-04
53
+ epoch=001 step=001250 train_loss=3.6563 lr=1.184834e-04
54
+ epoch=001 step=001300 train_loss=3.5761 lr=1.232227e-04
55
+ epoch=001 step=001350 train_loss=3.5077 lr=1.279621e-04
56
+ epoch=001 step=001400 train_loss=3.4228 lr=1.327014e-04
57
+ epoch=001 step=001450 train_loss=3.3561 lr=1.374408e-04
58
+ epoch=001 step=001500 train_loss=3.2874 lr=1.421801e-04
59
+ epoch=001 step=001550 train_loss=3.2238 lr=1.469194e-04
60
+ epoch=001 step=001600 train_loss=3.1665 lr=1.516588e-04
61
+ epoch=001 step=001650 train_loss=3.0983 lr=1.563981e-04
62
+ epoch=001 step=001700 train_loss=3.0406 lr=1.611374e-04
63
+ epoch=001 step=001750 train_loss=2.9943 lr=1.658768e-04
64
+ epoch=001 step=001800 train_loss=2.9474 lr=1.706161e-04
65
+ epoch=001 step=001850 train_loss=2.8880 lr=1.753555e-04
66
+ epoch=001 step=001900 train_loss=2.8516 lr=1.800948e-04
67
+ epoch=001 step=001950 train_loss=2.8169 lr=1.848341e-04
68
+ Step checkpoint saved: epoch=001 step=002000 loss~2.7854
69
+ epoch=001 step=002000 train_loss=2.7854 lr=1.895735e-04
70
+ epoch=001 step=002050 train_loss=2.7574 lr=1.943128e-04
71
+ epoch=001 step=002100 train_loss=2.7148 lr=1.990521e-04
72
+ epoch=001 step=002150 train_loss=2.6977 lr=1.999997e-04
73
+ epoch=001 step=002200 train_loss=2.6649 lr=1.999983e-04
74
+ epoch=001 step=002250 train_loss=2.6285 lr=1.999960e-04
75
+ epoch=001 step=002300 train_loss=2.6148 lr=1.999925e-04
76
+ epoch=001 step=002350 train_loss=2.5961 lr=1.999881e-04
77
+ epoch=001 step=002400 train_loss=2.5664 lr=1.999826e-04
78
+ Epoch 001 | train_loss=3.7736 | val_loss=2.5591 | ppl=12.92
79
+ /usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py:83: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
80
+ return func(*args, **kwargs)
81
+ epoch=002 step=002450 train_loss=2.5690 lr=1.999761e-04
82
+ epoch=002 step=002500 train_loss=2.5478 lr=1.999686e-04
83
+ epoch=002 step=002550 train_loss=2.5284 lr=1.999600e-04
84
+ epoch=002 step=002600 train_loss=2.5071 lr=1.999505e-04
85
+ epoch=002 step=002650 train_loss=2.4980 lr=1.999398e-04
86
+ epoch=002 step=002700 train_loss=2.4797 lr=1.999282e-04
87
+ epoch=002 step=002750 train_loss=2.4798 lr=1.999155e-04
88
+ epoch=002 step=002800 train_loss=2.4632 lr=1.999018e-04
89
+ epoch=002 step=002850 train_loss=2.4440 lr=1.998870e-04
90
+ epoch=002 step=002900 train_loss=2.4474 lr=1.998712e-04
91
+ epoch=002 step=002950 train_loss=2.4423 lr=1.998544e-04
92
+ Step checkpoint saved: epoch=002 step=003000 loss~2.4265
93
+ #NOTE: This model was trained in two separate sessions This is where session two began.
94
+ epoch=003 step=003050 train_loss=2.4215 lr=1.998177e-04
95
+ epoch=003 step=003100 train_loss=2.4032 lr=1.997978e-04
96
+ epoch=003 step=003150 train_loss=2.4000 lr=1.997769e-04
97
+ epoch=003 step=003200 train_loss=2.3878 lr=1.997549e-04
98
+ epoch=003 step=003250 train_loss=2.3915 lr=1.997319e-04
99
+ epoch=003 step=003300 train_loss=2.3817 lr=1.997079e-04
100
+ epoch=003 step=003350 train_loss=2.3815 lr=1.996829e-04
101
+ epoch=003 step=003400 train_loss=2.3769 lr=1.996568e-04
102
+ epoch=003 step=003450 train_loss=2.3592 lr=1.996297e-04
103
+ epoch=003 step=003500 train_loss=2.3582 lr=1.996016e-04
104
+ epoch=003 step=003550 train_loss=2.3433 lr=1.995724e-04
105
+ epoch=003 step=003600 train_loss=2.3343 lr=1.995422e-04
106
+ epoch=003 step=003650 train_loss=2.3458 lr=1.995110e-04
107
+ epoch=003 step=003700 train_loss=2.3448 lr=1.994788e-04
108
+ epoch=003 step=003750 train_loss=2.3167 lr=1.994455e-04
109
+ epoch=003 step=003800 train_loss=2.3137 lr=1.994112e-04
110
+ epoch=003 step=003850 train_loss=2.3265 lr=1.993759e-04
111
+ epoch=003 step=003900 train_loss=2.3250 lr=1.993396e-04
112
+ epoch=003 step=003950 train_loss=2.3083 lr=1.993022e-04
113
+ Step checkpoint saved: epoch=003 step=004000 loss~2.3077
114
+ epoch=003 step=004000 train_loss=2.3077 lr=1.992638e-04
115
+ epoch=003 step=004050 train_loss=2.3036 lr=1.992244e-04
116
+ epoch=003 step=004100 train_loss=2.2895 lr=1.991840e-04
117
+ epoch=003 step=004150 train_loss=2.2808 lr=1.991425e-04
118
+ epoch=003 step=004200 train_loss=2.2893 lr=1.991000e-04
119
+ epoch=003 step=004250 train_loss=2.2891 lr=1.990565e-04
120
+ epoch=003 step=004300 train_loss=2.2760 lr=1.990120e-04
121
+ epoch=003 step=004350 train_loss=2.2777 lr=1.989665e-04
122
+ epoch=003 step=004400 train_loss=2.2768 lr=1.989199e-04
123
+ epoch=003 step=004450 train_loss=2.2673 lr=1.988723e-04
124
+ epoch=003 step=004500 train_loss=2.2637 lr=1.988237e-04
125
+ epoch=003 step=004550 train_loss=2.2659 lr=1.987741e-04
126
+ epoch=003 step=004600 train_loss=2.2515 lr=1.987235e-04
127
+ epoch=003 step=004650 train_loss=2.2331 lr=1.986718e-04
128
+ epoch=003 step=004700 train_loss=2.2511 lr=1.986191e-04
129
+ epoch=003 step=004750 train_loss=2.2188 lr=1.985655e-04
130
+ epoch=003 step=004800 train_loss=2.2482 lr=1.985108e-04
131
+ epoch=003 step=004850 train_loss=2.2448 lr=1.984550e-04
132
+ epoch=003 step=004900 train_loss=2.2246 lr=1.983983e-04
133
+ epoch=003 step=004950 train_loss=2.2188 lr=1.983406e-04
134
+ Step checkpoint saved: epoch=003 step=005000 loss~2.2113
135
+ epoch=003 step=005000 train_loss=2.2113 lr=1.982818e-04
136
+ epoch=003 step=005050 train_loss=2.2207 lr=1.982220e-04
137
+ epoch=003 step=005100 train_loss=2.2103 lr=1.981613e-04
138
+ epoch=003 step=005150 train_loss=2.1865 lr=1.980995e-04
139
+ epoch=003 step=005200 train_loss=2.1997 lr=1.980367e-04
140
+ epoch=003 step=005250 train_loss=2.2105 lr=1.979729e-04
141
+ epoch=003 step=005300 train_loss=2.1968 lr=1.979080e-04
142
+ epoch=003 step=005350 train_loss=2.1893 lr=1.978422e-04
143
+ epoch=003 step=005400 train_loss=2.1870 lr=1.977754e-04
144
+ Epoch 003 | train_loss=2.2880 | val_loss=2.1674 | ppl=8.74
145
+ Epoch bundle saved: /kaggle/working/outputs/sub100m_e_40m_100k/kaggle_model_exports/epoch_003
146
+ /usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py:83: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
147
+ return func(*args, **kwargs)
148
+ epoch=004 step=005450 train_loss=2.1949 lr=1.977075e-04
149
+ epoch=004 step=005500 train_loss=2.1765 lr=1.976387e-04
150
+ epoch=004 step=005550 train_loss=2.1790 lr=1.975688e-04
151
+ epoch=004 step=005600 train_loss=2.1691 lr=1.974980e-04
152
+ epoch=004 step=005650 train_loss=2.1764 lr=1.974261e-04
153
+ epoch=004 step=005700 train_loss=2.1610 lr=1.973533e-04
154
+ epoch=004 step=005750 train_loss=2.1606 lr=1.972794e-04
155
+ epoch=004 step=005800 train_loss=2.1540 lr=1.972045e-04
156
+ epoch=004 step=005850 train_loss=2.1448 lr=1.971287e-04
157
+ epoch=004 step=005900 train_loss=2.1464 lr=1.970518e-04
158
+ epoch=004 step=005950 train_loss=2.1550 lr=1.969739e-04
159
+ Step checkpoint saved: epoch=004 step=006000 loss~2.1383
160
+ epoch=004 step=006000 train_loss=2.1383 lr=1.968951e-04
161
+ epoch=004 step=006050 train_loss=2.1607 lr=1.968152e-04
162
+ epoch=004 step=006100 train_loss=2.1325 lr=1.967344e-04
163
+ epoch=004 step=006150 train_loss=2.1383 lr=1.966525e-04
164
+ epoch=004 step=006200 train_loss=2.1425 lr=1.965697e-04
165
+ epoch=004 step=006250 train_loss=2.1288 lr=1.964859e-04
166
+ epoch=004 step=006300 train_loss=2.1408 lr=1.964011e-04
167
+ epoch=004 step=006350 train_loss=2.1292 lr=1.963152e-04
168
+ epoch=004 step=006400 train_loss=2.1468 lr=1.962284e-04
169
+ epoch=004 step=006450 train_loss=2.1370 lr=1.961406e-04
170
+ epoch=004 step=006500 train_loss=2.1126 lr=1.960519e-04
171
+ epoch=004 step=006550 train_loss=2.1325 lr=1.959621e-04
172
+ epoch=004 step=006600 train_loss=2.1084 lr=1.958714e-04
173
+ epoch=004 step=006650 train_loss=2.1124 lr=1.957796e-04
174
+ epoch=004 step=006700 train_loss=2.1193 lr=1.956869e-04
175
+ epoch=004 step=006750 train_loss=2.1147 lr=1.955932e-04
176
+ epoch=004 step=006800 train_loss=2.1131 lr=1.954985e-04
177
+ epoch=004 step=006850 train_loss=2.0983 lr=1.954029e-04
178
+ epoch=004 step=006900 train_loss=2.1052 lr=1.953062e-04
179
+ epoch=004 step=006950 train_loss=2.1126 lr=1.952086e-04
180
+ Step checkpoint saved: epoch=004 step=007000 loss~2.1097
181
+ epoch=004 step=007000 train_loss=2.1097 lr=1.951100e-04