Tiny_Moe_2 / logs /output_run_20260201_potlatch_abdal.log
vikramp's picture
Upload folder using huggingface_hub
eca2044 verified
2026-02-01 13:30:15,590 - root - INFO - Run: run_20260201_potlatch_abdal
2026-02-01 13:30:15,591 - root - INFO - Log directory: /root/tiny_moe/training_runs/Tiny_MoE/logs
2026-02-01 13:30:15,591 - root - INFO - Output dir: /root/tiny_moe/training_runs
2026-02-01 13:30:17,277 - jax._src.xla_bridge - INFO - Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
2026-02-01 13:30:20,679 - root - INFO - Flax version: 0.11.1
2026-02-01 13:30:20,679 - root - INFO - Optax version: 0.2.6
2026-02-01 13:30:20,679 - root - INFO - Platform: gpu
2026-02-01 13:30:20,679 - root - INFO - Num Devices: 8
2026-02-01 13:30:20,679 - root - INFO - Devices: [CudaDevice(id=0), CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3), CudaDevice(id=4), CudaDevice(id=5), CudaDevice(id=6), CudaDevice(id=7)]
2026-02-01 13:30:21,576 - root - INFO - Model config:
Config(name='Tiny_MoE',
dtype=<class 'jax.numpy.bfloat16'>,
vocab_size=50304,
block_size=2048,
n_layer=30,
n_embed=672,
n_glu_hidden=2048,
n_head=12,
n_kv_head=4,
n_experts=8,
init_stddev=0.02,
expert_load_factor=1.25,
aux_loss_coeff=0.01,
moe_bias=True,
mlp_bias=False,
attention_bias=False,
load_balance_loss_coeff=0.01,
z_loss_coeff=0.0005,
expert_top_k=2,
ln_epsilon=1e-05,
rope_theta=0.0001,
expert_partition_spec=PartitionSpec('devices',),
sdpa_implementation='cudnn',
value_residual_init=0.5,
unet_skip_in_layers=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14),
unet_skip_out_layers=(15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29))
2026-02-01 13:32:08,466 - root - INFO - Parameter Count: 1,062,182,190
2026-02-01 13:32:08,466 - root - INFO - Sharded / MoE Parameter Count: 992,210,160
2026-02-01 13:32:08,466 - root - INFO - Replicated Parameter Count: 69,972,030
2026-02-01 13:32:09,750 - root - INFO - Weight decay param count: 1,062,140,928
2026-02-01 13:32:09,751 - root - INFO - Training config:
TrainerConfig(num_tokens=100000000000,
num_tokens_per_batch=262144,
mB=128,
T=2048,
max_steps=381469,
max_lr=0.001,
min_lr=0.0001,
max_grad_norm=1.0,
weight_decay=0.1,
adam_b1=0.9,
adam_b2=0.95,
warmup_steps=3814,
print_interval=100,
val=True,
val_interval=5000,
val_batches=50,
checkpoint_model=False,
checkpoint_optimizer=False,
checkpoint_interval=10000)
2026-02-01 13:32:09,751 - root - INFO - Effective batch size per device: 16
2026-02-01 13:32:14,139 - root - INFO - ModdedNanoGPTDataLoader: 1030 shards (train)
2026-02-01 13:32:14,223 - root - INFO - HuggingfaceDataLoader initialized:
------------------------
label: train
shards: 1,030
shard size: 100,000,000
batch size: 128
block size: 2048
device rank: 1
start shard: 0
start pos: 0
------------------------
2026-02-01 13:32:14,224 - root - INFO - ModdedNanoGPTDataLoader: 1 shards (val)
2026-02-01 13:32:14,313 - root - INFO - Starting from step: 0
2026-02-01 13:33:53,196 - root - INFO - 0 | lr: 0.0000 | loss: 13.8432 | logits loss: 13.5000 | load balance loss: 30.1171 | z loss: 146.0000 | avg iter time: 0.00ms | avg tok/sec: 0.00 | tokens processed: 262,144
2026-02-01 13:36:51,924 - root - INFO - 100 | lr: 0.0000 | loss: 8.5947 | logits loss: 8.2500 | load balance loss: 30.2057 | z loss: 38.7500 | avg iter time: 1780.04ms | avg tok/sec: 147,268.27 | tokens processed: 26,476,544
2026-02-01 13:38:21,054 - root - INFO - 200 | lr: 0.0001 | loss: 7.2255 | logits loss: 6.9062 | load balance loss: 30.2648 | z loss: 8.5000 | avg iter time: 884.06ms | avg tok/sec: 296,524.22 | tokens processed: 52,690,944
2026-02-01 13:39:50,281 - root - INFO - 300 | lr: 0.0001 | loss: 6.4137 | logits loss: 6.0938 | load balance loss: 30.1918 | z loss: 8.0000 | avg iter time: 885.08ms | avg tok/sec: 296,181.29 | tokens processed: 78,905,344
2026-02-01 13:41:19,958 - root - INFO - 400 | lr: 0.0001 | loss: 6.0799 | logits loss: 5.7812 | load balance loss: 30.1965 | z loss: 7.1562 | avg iter time: 889.45ms | avg tok/sec: 294,727.40 | tokens processed: 105,119,744
2026-02-01 13:42:49,615 - root - INFO - 500 | lr: 0.0001 | loss: 5.6703 | logits loss: 5.3750 | load balance loss: 30.1260 | z loss: 6.1875 | avg iter time: 889.34ms | avg tok/sec: 294,761.50 | tokens processed: 131,334,144
2026-02-01 13:44:18,869 - root - INFO - 600 | lr: 0.0002 | loss: 5.3857 | logits loss: 5.0938 | load balance loss: 30.1257 | z loss: 4.3125 | avg iter time: 888.75ms | avg tok/sec: 294,958.64 | tokens processed: 157,548,544
2026-02-01 13:45:48,094 - root - INFO - 700 | lr: 0.0002 | loss: 5.2243 | logits loss: 4.9062 | load balance loss: 30.2017 | z loss: 3.8750 | avg iter time: 884.95ms | avg tok/sec: 296,225.57 | tokens processed: 183,762,944
2026-02-01 13:46:10,119 - root - WARNING - Received KeyboardInterrupt. Exiting...
2026-02-01 13:46:10,417 - root - INFO - Training completed.