Twinkles01 commited on Mar 30

Commit

891954b

verified ·

1 Parent(s): 506f0ec

Upload Main_200022

Browse files

Files changed (21) hide show

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_005000.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_010000.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_015000.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_020000.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_025000.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_030000.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_035000.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_040000.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_045000.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_050000.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_055000.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_060000.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_065000.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_070000.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_075000.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_080000.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_best_val_loss.pt +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/cfg.txt +55 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/tensorboard/events.out.tfevents.1769289622.brev-5x9knwe1p.3461335.0 +3 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/timemoe_base.py +126 -0
Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/training_log_20260124212010.log +153 -0

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_005000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cff4250240b4888f338ab99aac75dd99684dcc0055b6a0378b7aa382277e1ce4
+size 151982040

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_010000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0bd2f1d2f6f1b1d7a1190f19e47105e18b7d085d4b299eb7ad29620bc5561b73
+size 151982040

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_015000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:62024c2a7c34b738ab7699a49e08a89f62acb2b470bd7b1f0c5cbe21e483cf4e
+size 151982040

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_020000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:778deda4d2034f4bdfca4d0a88d419e01b4b69cedf06354253cb28c4c75aaa04
+size 151982040

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_025000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cacdc0add680951388ffe14cee04eee5195f3260c1654d0ecb3d1a08d0cc237e
+size 151982040

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_030000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:55862dd8c50eaf3998e5abd54f36df04b301085a0ed087e351bdffd2679f14eb
+size 151982040

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_035000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:71138a75fdc9afa7be3514cb6d66b87c97ac6b237fbe7857c4547bdc61eea2bd
+size 151982040

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_040000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3062451c9f846aa0dbd38a20882e3e620a2adc4deff6eeed6d83a10d18ade787
+size 151982040

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_045000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e72a92174d6b0daba0e99b643ec77ed4407ae57b6edfcae3ceef948a8e067408
+size 151982040

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_050000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:61b326e52409da619ca42c6a366265f2bb446b49aff4ab030b20259bf49ea8c8
+size 151982040

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_055000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1dd1d92d104651fab5a9c8f42219315e56dd71be2e984c3369d64e1abaa2c27d
+size 151982040

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_060000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a5876df9482d067d00bc0b3377b12ac9c553f4cee3472f5a57ec26c5b8fe22ca
+size 151982040

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_065000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3bedcb8fc69e12fa7b9a812929e49d568c3444e898a6b539804a6401516deae6
+size 151982040

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_070000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9d030ee72bb8cd1987c8a788fef6ddfdcdae5631f5bc6bd3e076118ea940b832
+size 151982040

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_075000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f6f5bb02f7c39a79a4a4e602819dc7d59fec142e2219f033d6694bd1cac08381
+size 151982040

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_080000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:257d4d30ba616ece27bb80aff4ee6a5f010a972a0d11ee15bc3523c5a3cebbf8
+size 151982040

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_best_val_loss.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7627660da6212df6d7e8ba52627a5b8bcca480206fa37642afdff1a39e419f52
+size 151985051

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/cfg.txt ADDED Viewed

	@@ -0,0 +1,55 @@

+DESCRIPTION: TimeMoE Base
+DEVICE: gpu
+DEVICE_NUM: 3
+RUNNER: <class 'baselines.TimeMoE4.runner.runner.TimeMoERunner'>
+MODEL:
+  NAME: TimeMoE4
+  ARCH: <class 'baselines.TimeMoE4.arch.timemoe.TimeMoE4'>
+  PARAM:
+    model_id: baselines/TimeMoE/ckpt/TimeMoE-50M
+    from_pretrained: False
+    context_length: 4079
+    trust_remote_code: True
+  DTYPE: bfloat16
+METRICS:
+  FUNCS:
+TRAIN:
+  COMPILE_MODEL: True
+  NUM_ITERATIONS: 200022
+  CKPT_SAVE_DIR: checkpoints/TimeMoE4/Main_200022
+  CKPT_SAVE_STRATEGY: 5000
+  LOSS: fake_loss
+  OPTIM:
+    TYPE: AdamW
+    PARAM:
+      lr: 0.001
+      betas: (0.9, 0.95)
+      fused: True
+  LR_SCHEDULER:
+    TYPE: CosineWarmup
+    PARAM:
+      num_warmup_steps: 10000
+      num_training_steps: 200022
+  CLIP_GRAD_PARAM:
+    max_norm: 1.0
+  DATA:
+    BATCH_SIZE: 85
+    SHUFFLE: True
+    PIN_MEMORY: True
+    PREFETCH: True
+  GRAD_ACCUMULATION_STEPS: 1
+VAL:
+  INTERVAL: 5000
+  DATA:
+    BATCH_SIZE: 170
+EVAL:
+  USE_GPU: True
+DATASET:
+  NAME: Main
+  TYPE: <class 'baselines.TimeMoE4.data.mix_dataset_v2.MixedSourceDataset_v2'>
+  PARAM:
+    num_valid_samples: 1000
+INFERENCE:
+  GENERATION_PARAMS:
+    normalize: True
+MD5: bebb26bf57d82ed6f0e5b7b943601e23

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/tensorboard/events.out.tfevents.1769289622.brev-5x9knwe1p.3461335.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:812262536992743c76a98068c50e963e61bbe4b87d39889724df1c55c2e7e7b7
+size 16990784

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/timemoe_base.py ADDED Viewed

	@@ -0,0 +1,126 @@

+# 采样概率变化
+import os
+import sys
+from easydict import EasyDict
+sys.path.append(os.path.abspath(__file__ + '/../../..'))
+from ..arch import TimeMoE4
+from ..data import MixedSourceDataset_v2
+from ..runner import TimeMoERunner
+from ..loss import fake_loss
+############################## Hot Parameters ##############################
+# Dataset & Metrics configuration
+# Model architecture and parameters
+pretrained = False  # Whether to use a pretrained model
+MODEL_ARCH = TimeMoE4
+MODEL_PARAM = {
+    'model_id': "baselines/TimeMoE/ckpt/TimeMoE-50M",
+    'from_pretrained': pretrained,
+    'context_length': 4079,
+    'trust_remote_code': True,
+}
+DATA_NAME = "Main"
+# N = 20_000_000
+# batch size = 16*8
+# 20_000_000 / 16 / 8 = 156250 iterations
+# 20_000_000 * 4096 / 16 / 8 / 4096 = 156_250
+NUM_ITERATIONS = 200_022 # 总轮数   20_000_000 * 4096 / 16 / 4 / 4096 = 312,500
+VAL_ITERATION_INTERVAL = 5_000 # 每VAL_ITERATION_INTERVAL执行一次验证
+############################## General Configuration ##############################
+CFG = EasyDict()
+# General settings
+CFG.DESCRIPTION = 'TimeMoE Base'
+CFG.DEVICE = 'gpu'
+CFG.DEVICE_NUM = 3
+# Runner
+CFG.RUNNER = TimeMoERunner
+############################## Model Configuration ################################
+CFG.MODEL = EasyDict()
+CFG.MODEL.NAME = MODEL_ARCH.__name__
+CFG.MODEL.ARCH = MODEL_ARCH
+CFG.MODEL.PARAM = MODEL_PARAM
+CFG.MODEL.DTYPE= 'bfloat16'
+# CFG.MODEL.DTYPE= 'float32'
+############################## Metrics Configuration ##############################
+CFG.METRICS = EasyDict()
+# Metrics settings
+CFG.METRICS.FUNCS = EasyDict({})
+############################## Training Configuration ##############################
+CFG.TRAIN = EasyDict()
+CFG.TRAIN.COMPILE_MODEL = True
+CFG.TRAIN.NUM_ITERATIONS = NUM_ITERATIONS
+CFG.TRAIN.CKPT_SAVE_DIR = os.path.join(
+    'checkpoints',
+    MODEL_ARCH.__name__,
+    '_'.join([DATA_NAME, str(CFG.TRAIN.NUM_ITERATIONS)])
+)
+CFG.TRAIN.CKPT_SAVE_STRATEGY = VAL_ITERATION_INTERVAL * 1 # 保存策略，每VAL_ITERATION_INTERVAL * 5保存一次模型
+CFG.TRAIN.LOSS = fake_loss
+# Optimizer settings
+CFG.TRAIN.OPTIM = EasyDict()
+CFG.TRAIN.OPTIM.TYPE = "AdamW"
+CFG.TRAIN.OPTIM.PARAM = {
+    "lr": 1e-3,
+    "betas": (0.9, 0.95),
+    # "betas": (0.9, 0.98),
+    "fused": True,
+    # "weight_decay": 1e-1,
+}
+# Learning rate scheduler settings
+CFG.TRAIN.LR_SCHEDULER = EasyDict()
+CFG.TRAIN.LR_SCHEDULER.TYPE = "CosineWarmup"
+CFG.TRAIN.LR_SCHEDULER.PARAM = {
+    'num_warmup_steps': 10_000, # 10k
+    'num_training_steps': NUM_ITERATIONS,
+}
+CFG.TRAIN.CLIP_GRAD_PARAM = {
+    'max_norm': 1.0
+}
+# Train data loader settings
+CFG.TRAIN.DATA = EasyDict()
+CFG.TRAIN.DATA.BATCH_SIZE = 85  # 16  /  4
+CFG.TRAIN.DATA.SHUFFLE = True # has to be False
+CFG.TRAIN.DATA.PIN_MEMORY = True
+CFG.TRAIN.DATA.PREFETCH = True
+CFG.TRAIN.GRAD_ACCUMULATION_STEPS = 1
+# CFG.TRAIN.DATA.NUM_WORKERS = 4
+############################## Validation Configuration ##############################
+CFG.VAL = EasyDict()
+CFG.VAL.INTERVAL = VAL_ITERATION_INTERVAL
+CFG.VAL.DATA = EasyDict()
+CFG.VAL.DATA.BATCH_SIZE = 170  # 32  /  8
+############################## Evaluation Configuration ##############################
+CFG.EVAL = EasyDict()
+# Evaluation parameters
+CFG.EVAL.USE_GPU = True # Whether to use GPU for evaluation. Default: True
+############################## Dataset Configuration ##############################
+CFG.DATASET = EasyDict()
+# Dataset settings
+CFG.DATASET.NAME = DATA_NAME
+CFG.DATASET.TYPE = MixedSourceDataset_v2
+CFG.DATASET.PARAM = EasyDict({
+    'num_valid_samples': 1000
+})
+############################## Inference Configuration ##############################
+CFG.INFERENCE = EasyDict()
+CFG.INFERENCE.GENERATION_PARAMS = EasyDict({
+    'normalize': not pretrained
+})

Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/training_log_20260124212010.log ADDED Viewed

	@@ -0,0 +1,153 @@

+2026-01-24 21:20:10,513 - easytorch-training - INFO - Initializing training.
+2026-01-24 21:20:10,515 - easytorch-training - INFO - Set clip grad, param: {'max_norm': 1.0}
+2026-01-24 21:20:10,515 - easytorch-training - INFO - Building training data loader.
+2026-01-24 21:20:22,467 - easytorch-training - INFO - MixedSourceDataset initialized for 'train' mode.
+2026-01-24 21:20:22,467 - easytorch-training - INFO -   - real: 3201174 samples
+2026-01-24 21:20:22,467 - easytorch-training - INFO -   - synth: 2000000 samples
+2026-01-24 21:20:22,468 - easytorch-training - INFO - Train dataset length: 3201174
+2026-01-24 21:20:22,470 - easytorch-training - INFO - Set optim: AdamW (
+Parameter Group 0
+    amsgrad: False
+    betas: (0.9, 0.95)
+    capturable: False
+    differentiable: False
+    eps: 1e-08
+    foreach: None
+    fused: True
+    lr: 0.001
+    maximize: False
+    weight_decay: 0.01
+)
+2026-01-24 21:20:22,470 - easytorch-training - INFO - Set lr_scheduler: <basicts.runners.optim.lr_schedulers.CosineWarmup object at 0x7aef5c7f08d0>
+2026-01-24 21:20:22,472 - easytorch-training - INFO - Initializing validation.
+2026-01-24 21:20:22,472 - easytorch-training - INFO - Building val data loader.
+2026-01-24 21:20:22,828 - easytorch-training - INFO - Worker 0 initialized for cauker_univariate.
+2026-01-24 21:20:51,581 - easytorch-training - INFO - MixedSourceDataset initialized for 'valid' mode.
+2026-01-24 21:20:51,581 - easytorch-training - INFO -   - real: 1000 samples
+2026-01-24 21:20:51,581 - easytorch-training - INFO - Valid dataset length: 1000
+2026-01-24 21:20:51,582 - easytorch-training - INFO - Number of parameters: 12653568
+2026-01-24 21:20:51,582 - easytorch-training - INFO - Training with 3 GPUs, batch size per GPUs: 85, grad_accumulation_steps: 1
+2026-01-24 21:20:51,582 - easytorch-training - INFO - Effective batch size: 255
+2026-01-24 22:19:53,382 - easytorch-training - INFO - Iteration 5000 / 200022
+2026-01-24 22:19:53,873 - easytorch-training - INFO - Result <train>: [train/iter_time: 0.71 (s), train/lr: 2.50e-04, train/loss: 4.7339, train/grad_norm: 10.7732, train/amp_scale: 1.0000]
+2026-01-24 22:19:53,873 - easytorch-training - INFO - Start validation.
+2026-01-24 22:20:06,117 - easytorch-training - INFO - Result <val>: [val/time: 12.07 (s), val/loss: 3.3541]
+2026-01-24 22:20:06,238 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_best_val_loss.pt saved
+2026-01-24 22:20:06,355 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_005000.pt saved
+2026-01-24 22:20:06,356 - easytorch-training - INFO - The estimated training finish time is 2026-01-26 12:50:58
+2026-01-24 23:20:46,773 - easytorch-training - INFO - Iteration 10000 / 200022
+2026-01-24 23:20:47,261 - easytorch-training - INFO - Result <train>: [train/iter_time: 0.73 (s), train/lr: 7.50e-04, train/loss: 2.9655, train/grad_norm: 11.4635, train/amp_scale: 1.0000]
+2026-01-24 23:20:47,262 - easytorch-training - INFO - Start validation.
+2026-01-24 23:20:54,273 - easytorch-training - INFO - Result <val>: [val/time: 6.83 (s), val/loss: 3.2979]
+2026-01-24 23:20:54,396 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_best_val_loss.pt saved
+2026-01-24 23:20:54,502 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_010000.pt saved
+2026-01-24 23:20:54,503 - easytorch-training - INFO - The estimated training finish time is 2026-01-26 13:22:05
+2026-01-25 00:21:30,485 - easytorch-training - INFO - Iteration 15000 / 200022
+2026-01-25 00:21:30,972 - easytorch-training - INFO - Result <train>: [train/iter_time: 0.73 (s), train/lr: 9.99e-04, train/loss: 2.7612, train/grad_norm: 9.0248, train/amp_scale: 1.0000]
+2026-01-25 00:21:30,972 - easytorch-training - INFO - Start validation.
+2026-01-25 00:21:37,654 - easytorch-training - INFO - Result <val>: [val/time: 6.50 (s), val/loss: 3.2530]
+2026-01-25 00:21:37,772 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_best_val_loss.pt saved
+2026-01-25 00:21:37,881 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_015000.pt saved
+2026-01-25 00:21:37,883 - easytorch-training - INFO - The estimated training finish time is 2026-01-26 13:31:24
+2026-01-25 01:22:17,571 - easytorch-training - INFO - Iteration 20000 / 200022
+2026-01-25 01:22:18,061 - easytorch-training - INFO - Result <train>: [train/iter_time: 0.73 (s), train/lr: 9.96e-04, train/loss: 2.6855, train/grad_norm: 8.4333, train/amp_scale: 1.0000]
+2026-01-25 01:22:18,061 - easytorch-training - INFO - Start validation.
+2026-01-25 01:22:24,743 - easytorch-training - INFO - Result <val>: [val/time: 6.50 (s), val/loss: 3.2349]
+2026-01-25 01:22:24,872 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_best_val_loss.pt saved
+2026-01-25 01:22:24,992 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_020000.pt saved
+2026-01-25 01:22:24,993 - easytorch-training - INFO - The estimated training finish time is 2026-01-26 13:36:41
+2026-01-25 02:22:28,599 - easytorch-training - INFO - Iteration 25000 / 200022
+2026-01-25 02:22:29,088 - easytorch-training - INFO - Result <train>: [train/iter_time: 0.72 (s), train/lr: 9.89e-04, train/loss: 2.6241, train/grad_norm: 8.2434, train/amp_scale: 1.0000]
+2026-01-25 02:22:29,089 - easytorch-training - INFO - Start validation.
+2026-01-25 02:22:35,766 - easytorch-training - INFO - Result <val>: [val/time: 6.50 (s), val/loss: 3.2258]
+2026-01-25 02:22:35,898 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_best_val_loss.pt saved
+2026-01-25 02:22:36,020 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_025000.pt saved
+2026-01-25 02:22:36,021 - easytorch-training - INFO - The estimated training finish time is 2026-01-26 13:35:03
+2026-01-25 03:22:45,874 - easytorch-training - INFO - Iteration 30000 / 200022
+2026-01-25 03:22:46,364 - easytorch-training - INFO - Result <train>: [train/iter_time: 0.72 (s), train/lr: 9.79e-04, train/loss: 2.5987, train/grad_norm: 7.8973, train/amp_scale: 1.0000]
+2026-01-25 03:22:46,364 - easytorch-training - INFO - Start validation.
+2026-01-25 03:22:53,079 - easytorch-training - INFO - Result <val>: [val/time: 6.54 (s), val/loss: 3.2204]
+2026-01-25 03:22:53,206 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_best_val_loss.pt saved
+2026-01-25 03:22:53,324 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_030000.pt saved
+2026-01-25 03:22:53,325 - easytorch-training - INFO - The estimated training finish time is 2026-01-26 13:34:39
+2026-01-25 04:22:46,580 - easytorch-training - INFO - Iteration 35000 / 200022
+2026-01-25 04:22:47,068 - easytorch-training - INFO - Result <train>: [train/iter_time: 0.72 (s), train/lr: 9.66e-04, train/loss: 2.5758, train/grad_norm: 7.7095, train/amp_scale: 1.0000]
+2026-01-25 04:22:47,069 - easytorch-training - INFO - Start validation.
+2026-01-25 04:22:53,794 - easytorch-training - INFO - Result <val>: [val/time: 6.55 (s), val/loss: 3.2095]
+2026-01-25 04:22:53,920 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_best_val_loss.pt saved
+2026-01-25 04:22:54,041 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_035000.pt saved
+2026-01-25 04:22:54,041 - easytorch-training - INFO - The estimated training finish time is 2026-01-26 13:32:47
+2026-01-25 05:27:02,297 - easytorch-training - INFO - Iteration 40000 / 200022
+2026-01-25 05:27:02,791 - easytorch-training - INFO - Result <train>: [train/iter_time: 0.77 (s), train/lr: 9.49e-04, train/loss: 2.5600, train/grad_norm: 7.5436, train/amp_scale: 1.0000]
+2026-01-25 05:27:02,792 - easytorch-training - INFO - Start validation.
+2026-01-25 05:27:09,513 - easytorch-training - INFO - Result <val>: [val/time: 6.54 (s), val/loss: 3.2000]
+2026-01-25 05:27:09,640 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_best_val_loss.pt saved
+2026-01-25 05:27:09,761 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_040000.pt saved
+2026-01-25 05:27:09,761 - easytorch-training - INFO - The estimated training finish time is 2026-01-26 13:52:38
+2026-01-25 06:27:52,101 - easytorch-training - INFO - Iteration 45000 / 200022
+2026-01-25 06:27:52,586 - easytorch-training - INFO - Result <train>: [train/iter_time: 0.73 (s), train/lr: 9.29e-04, train/loss: 2.5417, train/grad_norm: 7.4302, train/amp_scale: 1.0000]
+2026-01-25 06:27:52,587 - easytorch-training - INFO - Start validation.
+2026-01-25 06:27:59,259 - easytorch-training - INFO - Result <val>: [val/time: 6.50 (s), val/loss: 3.2017]
+2026-01-25 06:27:59,368 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_045000.pt saved
+2026-01-25 06:27:59,369 - easytorch-training - INFO - The estimated training finish time is 2026-01-26 13:52:48
+2026-01-25 07:27:24,716 - easytorch-training - INFO - Iteration 50000 / 200022
+2026-01-25 07:27:25,203 - easytorch-training - INFO - Result <train>: [train/iter_time: 0.71 (s), train/lr: 9.07e-04, train/loss: 2.5299, train/grad_norm: 7.4752, train/amp_scale: 1.0000]
+2026-01-25 07:27:25,203 - easytorch-training - INFO - Start validation.
+2026-01-25 07:27:31,874 - easytorch-training - INFO - Result <val>: [val/time: 6.49 (s), val/loss: 3.1984]
+2026-01-25 07:27:31,990 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_best_val_loss.pt saved
+2026-01-25 07:27:32,097 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_050000.pt saved
+2026-01-25 07:27:32,098 - easytorch-training - INFO - The estimated training finish time is 2026-01-26 13:47:49
+2026-01-25 08:26:48,978 - easytorch-training - INFO - Iteration 55000 / 200022
+2026-01-25 08:26:49,467 - easytorch-training - INFO - Result <train>: [train/iter_time: 0.71 (s), train/lr: 8.81e-04, train/loss: 2.5206, train/grad_norm: 7.3313, train/amp_scale: 1.0000]
+2026-01-25 08:26:49,468 - easytorch-training - INFO - Start validation.
+2026-01-25 08:26:56,155 - easytorch-training - INFO - Result <val>: [val/time: 6.51 (s), val/loss: 3.1930]
+2026-01-25 08:26:56,272 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_best_val_loss.pt saved
+2026-01-25 08:26:56,379 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_055000.pt saved
+2026-01-25 08:26:56,380 - easytorch-training - INFO - The estimated training finish time is 2026-01-26 13:43:14
+2026-01-25 09:26:45,895 - easytorch-training - INFO - Iteration 60000 / 200022
+2026-01-25 09:26:46,381 - easytorch-training - INFO - Result <train>: [train/iter_time: 0.72 (s), train/lr: 8.53e-04, train/loss: 2.5156, train/grad_norm: 7.3280, train/amp_scale: 1.0000]
+2026-01-25 09:26:46,381 - easytorch-training - INFO - Start validation.
+2026-01-25 09:26:55,801 - easytorch-training - INFO - Result <val>: [val/time: 9.24 (s), val/loss: 3.2032]
+2026-01-25 09:26:55,921 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_060000.pt saved
+2026-01-25 09:26:55,922 - easytorch-training - INFO - The estimated training finish time is 2026-01-26 13:41:22
+2026-01-25 10:26:01,222 - easytorch-training - INFO - Iteration 65000 / 200022
+2026-01-25 10:26:01,711 - easytorch-training - INFO - Result <train>: [train/iter_time: 0.71 (s), train/lr: 8.23e-04, train/loss: 2.5085, train/grad_norm: 7.0677, train/amp_scale: 1.0000]
+2026-01-25 10:26:01,711 - easytorch-training - INFO - Start validation.
+2026-01-25 10:26:08,445 - easytorch-training - INFO - Result <val>: [val/time: 6.56 (s), val/loss: 3.1867]
+2026-01-25 10:26:08,571 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_best_val_loss.pt saved
+2026-01-25 10:26:08,681 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_065000.pt saved
+2026-01-25 10:26:08,682 - easytorch-training - INFO - The estimated training finish time is 2026-01-26 13:37:23
+2026-01-25 11:25:19,409 - easytorch-training - INFO - Iteration 70000 / 200022
+2026-01-25 11:25:31,825 - easytorch-training - INFO - Result <train>: [train/iter_time: 0.71 (s), train/lr: 7.91e-04, train/loss: 2.4918, train/grad_norm: 7.1327, train/amp_scale: 1.0000]
+2026-01-25 11:25:31,826 - easytorch-training - INFO - Start validation.
+2026-01-25 11:25:38,502 - easytorch-training - INFO - Result <val>: [val/time: 6.50 (s), val/loss: 3.1853]
+2026-01-25 11:25:38,630 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_best_val_loss.pt saved
+2026-01-25 11:25:38,745 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_070000.pt saved
+2026-01-25 11:25:38,745 - easytorch-training - INFO - The estimated training finish time is 2026-01-26 13:34:47
+2026-01-25 12:25:25,615 - easytorch-training - INFO - Iteration 75000 / 200022
+2026-01-25 12:25:26,101 - easytorch-training - INFO - Result <train>: [train/iter_time: 0.72 (s), train/lr: 7.56e-04, train/loss: 2.4988, train/grad_norm: 7.1837, train/amp_scale: 1.0000]
+2026-01-25 12:25:26,102 - easytorch-training - INFO - Start validation.
+2026-01-25 12:25:32,812 - easytorch-training - INFO - Result <val>: [val/time: 6.53 (s), val/loss: 3.2090]
+2026-01-25 12:25:32,921 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_075000.pt saved
+2026-01-25 12:25:32,921 - easytorch-training - INFO - The estimated training finish time is 2026-01-26 13:33:37
+2026-01-25 13:24:03,457 - easytorch-training - INFO - Iteration 80000 / 200022
+2026-01-25 13:24:03,943 - easytorch-training - INFO - Result <train>: [train/iter_time: 0.70 (s), train/lr: 7.20e-04, train/loss: 2.4857, train/grad_norm: 7.0993, train/amp_scale: 1.0000]
+2026-01-25 13:24:03,944 - easytorch-training - INFO - Start validation.
+2026-01-25 13:24:10,645 - easytorch-training - INFO - Result <val>: [val/time: 6.52 (s), val/loss: 3.1926]
+2026-01-25 13:24:10,767 - easytorch-training - INFO - Checkpoint checkpoints/TimeMoE4/Main_200022/bebb26bf57d82ed6f0e5b7b943601e23/TimeMoE4_080000.pt saved
+2026-01-25 13:24:10,768 - easytorch-training - INFO - The estimated training finish time is 2026-01-26 13:29:25
+2026-01-25 13:47:33,794 - easytorch-training - ERROR - Traceback (most recent call last):
+  File "/home/nvidia/miniconda3/envs/zxx/lib/python3.11/site-packages/easytorch/launcher/launcher.py", line 31, in training_func
+    runner.train(cfg)
+  File "/lp-dev/zhouxx/BasicTS/basicts/runners/base_iteration_runner.py", line 200, in train
+    self.train_iters(iteration=iteration, dataloader=self.train_data_loader)
+  File "/lp-dev/zhouxx/BasicTS/basicts/runners/base_utsf_runner.py", line 281, in train_iters
+    self.backward(loss, accumulating=accumulating)
+  File "/lp-dev/zhouxx/BasicTS/basicts/runners/base_utsf_runner.py", line 337, in backward
+    grad_norm = sum(
+                ^^^^
+  File "/lp-dev/zhouxx/BasicTS/basicts/runners/base_utsf_runner.py", line 338, in <genexpr>
+    param.grad.data.norm(2).item() ** 2 for param in self.model.parameters() if param.grad is not None
+    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+KeyboardInterrupt